Estimate resource requirements for genome assembly with Abyss v2.2.5

mesamuels commented 2 years ago

I want to assembly a genome from paired-end Illumina short reads (2x150). I realize this will be fairly fragmented, for my purposes that is ok for now. I just need to estimate the system resources to request via the Compute Canada slurm scheduler, to give Abyss v2.0 (2.2.5) a chance to finish before timing out or running out of memory. I am in Montreal and am working on Narval for the moment, although I could use another CC computer cluster if that is necessary.

The estimated size of the target genome is 1.4 Gb. I have a total of about 75 Gb of good sequence (adapters trimmed and quality trimmed with FastP), so about 50X coverage.

The genome of the plant in question is allotetraploid, meaning that there are two diploid subgenomes which are fairly similar. Thus there are four alleles at each unique site. However the degree of heterozygosity within each diploid subgenome is less than the amount of difference between the two subgenomes. I'm not worried about decisions regarding calling heterozygous sites within each subgenome, the goal is to minimize mis-assembly of reads originating in equivalent segments of the two subgenomes.

Thanks! Mark Samuels

vlad0x00 commented 2 years ago

An assembly for a genome that size should finish within a day if you give it 48 threads. In terms of memory usage, an assembly would need about 25GiB or so. If possible, I would also recommend using the latest (2.3.4) ABySS version. Pass j=48 for 48 threads and B=20G for a 20GiB Bloom filter (the assembler will likely need a bit of extra memory for other things, hence the ~25GiB usage).

The main parameters you'd want to optimize is k-mer size k and k-mer multiplicity kc. In majority of cases, kc=2 works well, so if you're testing things out you might not need to care about optimizing it. If you do want to optimize, it's sufficient to check kc=2 and kc=3. For your read size and coverage, the correct k value is probably somewhere between 60 and 120, which you can sweep with a step of 10. Ideally you'd sweep a wider range (maybe 40-140 with a step of 5), but it depends how much resources you want to use. Each assembly will output a assembly-stats file -- look for the assembly with the highest N50 in there as your best one.

After you find the right k and kc, I'd then try optimizing PopBubbles as per our email exchange. You'd want to make sure it doesn't collapse paths from different subgenomes, so trying out different values for the --identity (0.9 by default) parameter might make sense. You pass parameters to PopBubbles specifically by exporting the POPBUBBLES_OPTIONS variable before the assembly, e.g.: export POPBUBBLES_OPTIONS="--identity 0.95"

mesamuels commented 2 years ago

Thanks Vlad, I will incorporate this information. I'm a little unclear on what "threads" are versus "cpus". Should I use the G flag to tell it my estimated genome size, or will it estimate that itself? The memory requirement is quite modest here, a lot of assemblers say you need something like 500 gigabytes or even more in total memory allocation. With a one day run time, it's reasonable to try a variety of parameters it seems.

I understand that a big N50 number is normally better, but given the polyploidy I'll be cautious about this optimization. From what is already known about the quinoa genome (closely related), there should ideally be two contigs for most unique genes, one for each subgenome, easily distinguished by multiple differences. If there is only one contig for many unique genes in the assembly, that would imply that the two subgenomic copies were merged incorrectly. So I may have to try k/kc parameters that don't give the longest N50, if they end up giving better separation of the two subgenomes for most genes.

I'll let you know how it goes. Obviously any publication coming out of this work would cite your publications.

Regards! Mark

vlad0x00 commented 2 years ago

The genome size is currently not automatically calculated, so passing the -G flag is helpful for summary statistics, e.g. it would give you assembly NG50 instead of N50.

The modest memory requirement is one of the advantages of using ABySS :) especially if you need to run a large number of assemblies either to optimize parameters or for different runs.

As for looking for the highest N50, you raise a good point. Would BUSCO metrics perhaps be more helpful here? They give you the number of genes found, including ones found in duplicates.

mesamuels commented 2 years ago

Thanks Vlad, I will include that -G flag. Memory usage is very impressive indeed!

For monitoring assembly, BUSCO sounds like a good thing to try. I've read about it but have to learn how to implement it. I was thinking in terms of just blasting the assembly contigs with some unique genes (for example from the quinoa genome assembly). Ideally there would only be two contigs per gene (or gene segment, probably many genes will be fragmented to multiple contigs due to repetitive elements in the introns).

mesamuels commented 2 years ago

Oops sorry one more question. Am I correct that I should do the quality and adapter trimming before running Abyss? Just want to make sure I understand the instructions correctly. Some of the other assemblers (Platanus and MaSurCa) do it themselves so say not to pre-trim.

I've used FastP, which is easier to implement than Trimmomatic and works better (according to the creators!), with a fairly stringent quality metric that removed about a quarter of all reads. Given the tetraploid genome, it seems best to lose some reads than to include dubious calls that might confuse the contigs. Even with that I still have about 50X coverage.

Thanks! Mark

vlad0x00 commented 2 years ago

Hi Mark,

ABySS doesn't do adapter trimming itself. We've used Trimmomatic in the past and found that it is not a bad idea to run it, but it is not essential to the assembly, as the adapter sequences would be separate components in the assembly graph, anyway. 50X is good coverage, so I think your assembly shouldn't suffer too much from the loss of reads. In an ideal case, though, I would do an assembly with and without trimming and compare the qualities, but that of course might be computationally prohibitive in your case.

mmokrejs commented 2 years ago

Hi Mark, I did not test FastP on my own but Trimmomatic works for standard Illumina adapters rather well. I had to increase its sensitivity to search for parts of adapters (for incompletely sequenced adapters), though. But it is weird that 1/4 of your raw data is trimmed away. The PE-distance should have been between 350 to 550 nt (the larger is better). Typically in most reads of say 150 nt in length you do not reach up to the adapter at all so that is why I wonder that 1/4 would not only contain them but also, that the sequence after trimming would be empty or just a few dozens of nucleotides. That would be a wet-lab issue.

I would not do the quality-based trimming or only very decently. I would rather do a k-mer-based error correction to make it easier for assembler like Abyss. You save the coverage. The Bloom filter in Abyss discards the erroneous k-mers so it does practically the same. For a tetraploid or recent hybrid genome the avg. coverage 50x is not overly high so you may still face errors in protein-coding regions. Sound like you have some additional PacBio or Oxford nanopore data too. I hope the 50x coverage is just based on the Illumina data otherwise you have have not enough in every such group.

mesamuels commented 2 years ago

Thanks guys, that's very helpful. As a test, I ran FastQC before and after running FastP. Before FastP, it calls a lot of adapter sequences, at the 3' ends of reads as expected. After running FastP, FastQC found no presumptive adapter sequences. So FastP seems good. I used very stringent cutoffs for read and base quality, requiring an average Phred score of 30 and maximum 15% of bases below 30. With lower Phred score thresholds, a lot more reads make it through so I can certainly add coverage by reducing stringency. Also, k-mer analysis of the cleaned dataset using either 25-mers or 50-mers gave predicted genome sizes of 800-900Mb, with a k-mer coverage shoulder at about twice the main peak. That suggests that at those k-mer sizes, some reads from the two subgenomes were counted together, reducing the total predicted genome size from 1.4Gb, and causing the secondary peak at twice the main coverage peak.

According to FastP, the peak insert size average in the paired-end library appears to be around 200+bp, and the software said only 23% of reads had no overlap between the ends. So the library insert size seems to be a bit smaller than ideal, I will ask the lab about that. But that reduces the apparent coverage as well, and I presume limits the ability of the assembler to disentangle similar repetitive elements. The quinoa genome has a lot of Copia transposons, so presumably my plant does also.

If the assembly really runs in just a day, I can try it with various input data sets using different stringencies. I like the idea of at least trimming the adapters before starting. I assume that the reads themselves are retained by FastP, just the adapter sequences removed from the 3' ends.

No, there is no long read data, just the Illumina paired-ends. This was a pilot experiment, done with a little bit of slush fund money, not dedicated funding. If I can see the two subgenomic copies of each unique gene in different contigs (even if genes are split among multiple contigs due to repeats in the introns), that will provide useful data for requesting additional funding.

mesamuels commented 2 years ago

A general question about requesting resources on the CC system. I'm unclear on how independent the values are that one can request for cpu's, threads, and memory per cpu. To get the suggested 48 threads, I guess I need to request 24 cpu's. With a maximum of 4gb memory per cpu, that would be 96 total gb of memory, way more than you suggest I need. Can I request less memory per cpu, such as 2gb, or does that automatically reduce the maximum number of threads I could have from 2 per cpu to 1, which would not be good? This kind of explicit resource allocation to run software is new to me, so am just learning.

Thanks!

mmokrejs commented 2 years ago

I would give up o the QUAL-based trimming and just do error-correction of the adapter-trimmed reads instead. Look for BBmap , it has several utilities to do this job. Do not merge read pairs into a single piece, let it up to the assembler to decide. It is unbelievebly fast and efficient.

Ask for a trace from Agient Bioanalyzer or TapeStation the sequencing provider to see what was the size of DNA fragments before and after ligation of sequencing adapters. But basically now you already know what you will see.

Abyss does not support MPI (anymore?) so you ask for a single machine with the largest amount of CPUs.

mesamuels commented 2 years ago

Thanks very much, will do!

vlad0x00 commented 2 years ago

A general question about requesting resources on the CC system. I'm unclear on how independent the values are that one can request for cpu's, threads, and memory per cpu. To get the suggested 48 threads, I guess I need to request 24 cpu's. With a maximum of 4gb memory per cpu, that would be 96 total gb of memory, way more than you suggest I need. Can I request less memory per cpu, such as 2gb, or does that automatically reduce the maximum number of threads I could have from 2 per cpu to 1, which would not be good? This kind of explicit resource allocation to run software is new to me, so am just learning.

Thanks!

I'm looking at the SLURM documentation from Compute Canada (https://docs.computecanada.ca/wiki/Running_jobs#Threaded_or_OpenMP_job) and in their case, you'd want to specify --cpus-per-task=48 to use 48 threads.

For memory, you don't need to specify it per CPU, you can specify total memory with --mem=30G. I'd give each assembly job 30G just in case so it doesn't crash due to insufficient memory.

mesamuels commented 2 years ago

Vlad, sorry to bug you but there's an inconsistency in the memory allocation syntax you suggest, versus what my colleague here says. He says that if I request --mem=30G, it will apparently restrict the number of cpu's accordingly so will end up with less threads. Supposedly each cpu will take at least a minimum of 3.9gb, so 30gb total won't get distributed equally over 48 cpus, it'll just take something like 8, giving only 8 threads. If so then there's no way to restrict the total memory to the smaller amount really required, I have to at least allocate 3.9x48=187gb, even if the program won't use it.

One more syntax thing, I can't find a description of the units for the -G genome size switch. Should it be -G=1.4G for 1.4 gigabases? I tried that with a quick interactive run just to see if the program would accept it, and it did. But then I tried with setting -G=1.4R, which I would think is meaningless, but the program still started to run. What is the exact syntax for that switch?

Thanks! Mark

vlad0x00 commented 2 years ago

Vlad, sorry to bug you but there's an inconsistency in the memory allocation syntax you suggest, versus what my colleague here says. He says that if I request --mem=30G, it will apparently restrict the number of cpu's accordingly so will end up with less threads. Supposedly each cpu will take at least a minimum of 3.9gb, so 30gb total won't get distributed equally over 48 cpus, it'll just take something like 8, giving only 8 threads. If so then there's no way to restrict the total memory to the smaller amount really required, I have to at least allocate 3.9x48=187gb, even if the program won't use it.

Hmm I haven't used Compute Canada's Slurm scheduler, so I'm not 100% sure, but I haven't heard about CPU's requiring minimum memory. Does your colleague know if there's any documentation on this that I can look up? If you want to be safe, you can run ABySS with fewer threads, e.g. 24. It might take it two days, but hopefully that's not a big deal, especially if you run multiple jobs in parallel to optimize the parameters.

One more syntax thing, I can't find a description of the units for the -G genome size switch. Should it be -G=1.4G for 1.4 gigabases? I tried that with a quick interactive run just to see if the program would accept it, and it did. But then I tried with setting -G=1.4R, which I would think is meaningless, but the program still started to run. What is the exact syntax for that switch?

The abyss-pe script takes parameters using the param=value syntax and the G parameter currently doesn't support metric prefixes, so you'd specify it like G=1400000000.

mesamuels commented 2 years ago

Thanks Vlad! I tried a quick test run to check the parameter syntax requesting just a few cpu's for one hour, and also my slurm batch script syntax, and all seemed ok, it ran for an hour and timed out, no output except an empty .fa file. Am trying the real run now with 48 cpus and 186gb of RAM, and slurm started the run almost immediately, so I guess that's still not a lot of demand on the system. Anyway, fyi here is the command line:

abyss-pe name=Cber1 j=48 k=96 kc=2 B=50G G=1400000000 in='../R1.cleanmin50.fastq.gz ../R2.cleanmin50.fastq.gz'

We'll see! Best, Mark

p.s. the output name is based on the plant name, which is Chenopodium berlandieri. Quinoa officially is Chenopodium quinoa, and this is a native canadian sister species. "cheno-podium" from greek "goose-foot" from the shape of the leaves.

vlad0x00 commented 2 years ago

The command looks good! Let me know if it gives you any issues during the run.

mesamuels commented 2 years ago

Thanks Vlad. So far it is running ok it seems. It has created a bunch of big .fa files plus some others, some of which seem to contain mini-contigs. I blasted a few of them against the online quinoa genome database and they get hits, not surprising since I get the same thing when I blast individual short reads against the database. Some of the minicontigs are rather large, those are probably repetitive elements at this point.

Maybe obvious, but I presume it will give some obvious report when it's done, so I know I can kill the resource allocation on CC if there is still a lot of time left.

Mark

vlad0x00 commented 2 years ago

Once the computing job is done, the resource should be immediately deallocated in Slurm. You can check which jobs are still running with squeue -u $USER (from https://docs.computecanada.ca/wiki/Running_jobs#Current_jobs). Once that's empty you know all computing jobs are done. For ABySS specifically, once Cber1-stats file is available in the assembly directory, the assembly is done. At that point, you can find the final sequences file in Cber1-scaffolds.fa.

mesamuels commented 2 years ago

Great thanks. A minor system question, it's running at the moment, but when I do a search for processes with my username (samuelsm), either with ps -u or top -u, I don't see anything with very large memory allocation, nor any process with a name that seems to be Abyss-related. Is it somehow all running somewhere else in the system so doesn't report those processes to me? I see the job with squeue, but that doesn't report memory usage of the processes, just the time remaining for the job. Or is it just not actually using much memory? Just curious, learning how to check the system state.

Mark

vlad0x00 commented 2 years ago

Submitting jobs to Slurm runs them on separate machines, dedicated to cluster computing. That's why you're not going to see anything about the processes that are running, because they're on different machines! squeue might tell you the name of the machine on which it runs and I know that sometimes it lets you ssh into the machine, after which you can run top or ps to see what's going on, but this is almost never necessary. squeue lets you print whatever information about each job you want (although it only prints some of the information by default), including memory. According to https://slurm.schedmd.com/squeue.html, it seems you'd need to pass a format string with "%m" in it, but that page has more information if you're interested in exactly how.

mmokrejs commented 2 years ago

You can safely ask for longer compute time, when the jobs exits the compute node is used for another job.

You can try passing https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/63459/100/GCF_001683475.1_ASM168347v1/GCF_001683475.1_ASM168347v1_rna.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/63459/100/GCF_001683475.1_ASM168347v1/ to abyss as transcriptome assembly to yield long-scaffs file. You will get an additional layer of scaffolding.

mesamuels commented 2 years ago

Thanks for confirming about the memory. Martin, I see that those links are to the quinoa genome, which this is similar to. But I don't understanding what you mean by 'passing' in terms of the commands to Abyss.

The run finished overnight, based on the times at which the output files were created it seems to have taken about 9 hours. I guess it finished normally, anyway there are lots of output files in the directory. Do I need to keep all of them, or just the 'last ones'? It's not clear which those are, I see the files named unitigs, contigs and scaffolds, but those seem to be pointers to other files with number names. Here are all the files in the directory:

abyss_run.sh Cber1-2.dot Cber1-3.dist Cber1-3.hist Cber1-4.path1 Cber1-5.fa Cber1-6.fa Cber1-7.dot Cber1-8.fa Cber1-scaffolds.dot Cber1-stats.html slurm-2353276.out Cber1-1.dot Cber1-2.dot1 Cber1-3.dot Cber1-4.dot Cber1-4.path2 Cber1-5.path Cber1-6.hist Cber1-7.fa Cber1-contigs.dot Cber1-scaffolds.fa Cber1-stats.md Cber1-1.fa Cber1-2.fa Cber1-3.fa Cber1-4.fa Cber1-4.path3 Cber1-6.dist.dot Cber1-6.path Cber1-7.path Cber1-contigs.fa Cber1-stats Cber1-stats.tab Cber1-1.path Cber1-2.path Cber1-3.fa.fai Cber1-4.fa.fai Cber1-5.dot Cber1-6.dot Cber1-6.path.dot Cber1-8.dot Cber1-indel.fa Cber1-stats.csv Cber1-unitigs.fa

Here is the file of stats.tab: n n:500 L50 LG50 NG50 min N75 N50 N25 E-size max sum name 2943789 527664 105387 301097 905 500 1137 2239 4166 3123 37169 850.7e6 Cber1-unitigs.fa 2717425 466637 85438 234703 1061 500 1415 2859 5178 3829 38261 863.3e6 Cber1-contigs.fa 2713092 463398 83561 231639 1060 500 1422 2913 5296 3908 41438 863.1e6 Cber1-scaffolds.fa

It looks like it didn't resolve the two sub genomes, or maybe lots of repetitive elements, if the genome size it got was about 860Mbp, and should be more like 1.4Gbp. The slurm-out file includes some suggestions, plus the ones you gave me previously about bubble popping and identity. Will try again after absorbing these results a bit.

Mark

mesamuels commented 2 years ago

Here is the list of output files with sizes:

-rwxr-x---. 1 samuelsm def-samuelsm 461 Feb 3 17:13 abyss_run.sh -rw-r-----. 1 samuelsm def-samuelsm 632841392 Feb 3 18:22 Cber1-1.dot -rw-r-----. 1 samuelsm def-samuelsm 1939169937 Feb 3 18:21 Cber1-1.fa -rw-r-----. 1 samuelsm def-samuelsm 0 Feb 3 18:22 Cber1-1.path -rw-r-----. 1 samuelsm def-samuelsm 446822726 Feb 3 18:24 Cber1-2.dot -rw-r-----. 1 samuelsm def-samuelsm 446822726 Feb 3 18:24 Cber1-2.dot1 -rw-r-----. 1 samuelsm def-samuelsm 1543924830 Feb 3 18:24 Cber1-2.fa -rw-r-----. 1 samuelsm def-samuelsm 2355105 Feb 3 18:34 Cber1-2.path -rw-r-----. 1 samuelsm def-samuelsm 5775037 Feb 3 22:13 Cber1-3.dist -rw-r-----. 1 samuelsm def-samuelsm 423801080 Feb 3 18:34 Cber1-3.dot -rw-r-----. 1 samuelsm def-samuelsm 1495184540 Feb 3 18:34 Cber1-3.fa -rw-r-----. 1 samuelsm def-samuelsm 88487296 Feb 3 22:14 Cber1-3.fa.fai -rw-r-----. 1 samuelsm def-samuelsm 12127 Feb 3 22:03 Cber1-3.hist -rw-r-----. 1 samuelsm def-samuelsm 423861468 Feb 3 22:13 Cber1-4.dot -rw-r-----. 1 samuelsm def-samuelsm 43101 Feb 3 22:13 Cber1-4.fa -rw-r-----. 1 samuelsm def-samuelsm 4864 Feb 3 22:14 Cber1-4.fa.fai -rw-r-----. 1 samuelsm def-samuelsm 4810559 Feb 3 22:14 Cber1-4.path1 -rw-r-----. 1 samuelsm def-samuelsm 3962804 Feb 3 22:14 Cber1-4.path2 -rw-r-----. 1 samuelsm def-samuelsm 3919592 Feb 3 22:14 Cber1-4.path3 -rw-r-----. 1 samuelsm def-samuelsm 423862698 Feb 3 22:15 Cber1-5.dot -rw-r-----. 1 samuelsm def-samuelsm 1447 Feb 3 22:15 Cber1-5.fa -rw-r-----. 1 samuelsm def-samuelsm 3919708 Feb 3 22:15 Cber1-5.path -rw-r-----. 1 samuelsm def-samuelsm 2340074 Feb 4 01:54 Cber1-6.dist.dot -rw-r-----. 1 samuelsm def-samuelsm 376764126 Feb 3 22:16 Cber1-6.dot -rw-r-----. 1 samuelsm def-samuelsm 1464479538 Feb 3 22:16 Cber1-6.fa -rw-r-----. 1 samuelsm def-samuelsm 19840 Feb 4 01:44 Cber1-6.hist -rw-r-----. 1 samuelsm def-samuelsm 91859 Feb 4 01:55 Cber1-6.path -rw-r-----. 1 samuelsm def-samuelsm 4296066 Feb 4 01:55 Cber1-6.path.dot -rw-r-----. 1 samuelsm def-samuelsm 376764492 Feb 4 01:55 Cber1-7.dot -rw-r-----. 1 samuelsm def-samuelsm 405 Feb 4 01:55 Cber1-7.fa -rw-r-----. 1 samuelsm def-samuelsm 97867 Feb 4 01:55 Cber1-7.path -rw-r-----. 1 samuelsm def-samuelsm 376017050 Feb 4 01:57 Cber1-8.dot -rw-r-----. 1 samuelsm def-samuelsm 1464053555 Feb 4 01:56 Cber1-8.fa lrwxrwxrwx. 1 samuelsm def-samuelsm 11 Feb 3 22:16 Cber1-contigs.dot -> Cber1-6.dot lrwxrwxrwx. 1 samuelsm def-samuelsm 10 Feb 3 22:16 Cber1-contigs.fa -> Cber1-6.fa -rw-r-----. 1 samuelsm def-samuelsm 26564334 Feb 3 18:34 Cber1-indel.fa lrwxrwxrwx. 1 samuelsm def-samuelsm 11 Feb 4 01:57 Cber1-scaffolds.dot -> Cber1-8.dot lrwxrwxrwx. 1 samuelsm def-samuelsm 10 Feb 4 01:56 Cber1-scaffolds.fa -> Cber1-8.fa lrwxrwxrwx. 1 samuelsm def-samuelsm 15 Feb 4 01:57 Cber1-stats -> Cber1-stats.tab -rw-r-----. 1 samuelsm def-samuelsm 324 Feb 4 01:57 Cber1-stats.csv -rw-r-----. 1 samuelsm def-samuelsm 1228 Feb 4 01:57 Cber1-stats.html -rw-r-----. 1 samuelsm def-samuelsm 552 Feb 4 01:57 Cber1-stats.md -rw-r-----. 1 samuelsm def-samuelsm 324 Feb 4 01:57 Cber1-stats.tab lrwxrwxrwx. 1 samuelsm def-samuelsm 10 Feb 3 18:34 Cber1-unitigs.fa -> Cber1-3.fa -rw-r-----. 1 samuelsm def-samuelsm 8070 Feb 4 01:57 slurm-2353276.out

mmokrejs commented 2 years ago

See https://github.com/bcgsc/abyss#rescaffolding-with-long-sequences , basically, rerun exactly same command as you did, just append to it

long='longa' longa='GCF_001683475.1_ASM168347v1_rna.fna'

You will get Cber1-long-scaffs.fa

mesamuels commented 2 years ago

Thanks Martin, I will try that, maybe after I try a few other immediate parameters to get longer contigs and better separate the two subgenomes.

If I understand the output, the final contigs file is Cber1-6.fa, and the final scaffolds file (which is not much different) is Cber1-8.fa. Can I delete all the other files from earlier passes (1-5, and 7)? Cber1-7.fa is small and looks like just a few reads with very simple sequences, presumably not worth keeping. Is there any reason to keep the .dot and .path files? Note that I will be repeating the run with different parameters, so only need to keep stuff useful for comparing the resulting assemblies.

Thanks, Mark

mmokrejs commented 2 years ago

Do it now before you delete the other files. Otherwise they will be re-computed. ;-)

Copy away the files mentioned in stats.tab, including the eventual Cber1-long-scaffs.fa . Actually, beware those are symlinks, so keep the target file too.

It would be easier to symlink the input files from a different directory, though.

mesamuels commented 2 years ago

Aha, I tried putting the longa command on a new line and Abyss didn't see it, it stopped saying nothing needed to be done. So I added it to the actual abyss-pe command line itself, and it started running, but then stopped because it did not recognize the bwa command. I had to load that module, which was simple, and now it seems to be running. Thanks!

Mark

mmokrejs commented 2 years ago

Hi Mark, Abyss is based on GNU make tool so it will detect that the output files arealready present and go directly to the long-scaffodling step. It will just run bwa mapper and do the scaffolding, it scales well and will be rather quick. That is why you need to do it now before you delete the files. Otherwise they would be recalculated again. After this completes, you can start deleting some of the unneeded files as I mentioned before. It will just create a few extra files, the current result will remain there. Probably it will overwrite the Cber1-stats.tab and Cber1-stats.md files, or actually append a single line to them (more likely).

So, add it to the same script you have. In principle, you can copy it under a new name and run the slurm jobs pointing to the modified file. It is just more work for you. The idea is just not to alter the other parameters so that abyss will not find any differences in runtime parameters.

mesamuels commented 2 years ago

Thanks Martin that's very clever of the software (developers). I will try that. I think I'll try improving the assembly itself first. I used a very stringent quality filter on the raw fastq read files, unnecessary as you and Vlad noted. I will try again just removing the adapters, and also poly-G runs that apparently tend to happen with the NovaSeq version of the Illumina sequencing (only 2 colors used, not four). But when I get as far as I can with the main assembly, then I will do the longa using the quinoa gene models from NCBI.

Many many thanks for all your assistance. I looked into a few other assemblers (MaSurCa, Platanus), and similarly the papers and online instructions need clarifications, but there is no communication back from the developers so it's hard to actually implement them.

mesamuels commented 2 years ago

BTW, the most recent version of Abyss on ComputeCanada seems to be v2.2.5. It sounds like there is a more recent one. I have yet to learn how to install software myself, but anyway since this is a general tool, it would be great if you guys updated to the most recent version on the main system!

mmokrejs commented 2 years ago

But the highest impact will have this long-scaffodling step, believe me. Make no sense not to do this quite light-weight step now. You can be done with the work quickly. I am not an abyss developer and am settled in Europe so have nothing in common with your cluster systems. ;)

mesamuels commented 2 years ago

Aha, thanks Martin for the advice. I did not realize you were not in Vancouver, all I see is your username and if I click on it, your full name. Maybe the guys there can update the software. I will try your suggestion with the current assembly, do I need the same number of threads for this last step (48)? The way our system is set up, to get that number of threads I have to request that number of cpu's, which is almost four times as much memory as the program actually uses.

Mark

mmokrejs commented 2 years ago

The read-mapping step scales well so any thread count will do, and the very last step writing out scaffolds takes if i rememebr right just a single thread. It does not matter which way you go. Changing this parameter will not trigger recomputation of the whole thing again. I would re-run it as it was only if the is a long waiting queue I would ask for less CPU threads, in multiples of 8 probably to your wish.

mesamuels commented 2 years ago

OK thanks Martin. I'm having a bit of trouble with the syntax, the program is stopping before it gets to the longa line of the script, saying there is nothing it needs to do. Should it be on a separate line or the same line as the main abyss-pe just with a space after the previous commands?

Mark

mmokrejs commented 2 years ago

I would say yes, really just append it to the abyss-pe line, after a space.

If that does not work, try to put the first part with long='longa' in the same ordering like in this example mentioned already above:

abyss-pe k=96 B=2G name=ecoli lib='pe1 pe2' mp='mp1 mp2' long='longa' \
    pe1='pe1_1.fa pe1_2.fa' pe2='pe2_1.fa pe2_2.fa' \
    mp1='mp1_1.fa mp1_2.fa' mp2='mp2_1.fa mp2_2.fa' \
    longa='longa.fa'

The backslash and a newline you can remove and keep that all on single line. You can safely try it on the commandline and press ctrl+c if abyss starts working properly and submit it then as a job.

mmokrejs commented 2 years ago

Did you run gzip -d GCF_001683475.1_ASM168347v1_rna.fna.gz after downloading it?

mesamuels commented 2 years ago

Hmm, I didn't download anything, I just added the text "long='longa' longa='GCF_001683475.1_ASM168347v1_rna.fna'" to the end of the abyss-pe command line. It seems to be running, but maybe it's just doing a BWA thing on the final scaffold, and then will give an error when it does not see that file in the directory? I will download the file and unzip it and try again!

Mark

mmokrejs commented 2 years ago

So run this on the commandline before starting the job:

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/63459/100/GCF_001683475.1_ASM168347v1/GCF_001683475.1_ASM168347v1_rna.fna.gz

gzip -d GCF_001683475.1_ASM168347v1_rna.fna.gz

Should output no errors.

mesamuels commented 2 years ago

Thanks, I already downloaded it to my home computer (where I am, still waiting for 14 days after my third i.e. booster vaccination before going to work, as I work at the hospital research center and there are multiple cases elsewhere in the building. Then I transferred it. But I did not know the commands to get files directly from NCBI with that wget command, so thanks! I will let you know how it goes.

Mark

mmokrejs commented 2 years ago

You could fetch it straight to the server into the directory with abyss data. Good luck then! I keep my fingers crossed.

mesamuels commented 2 years ago

Yes now I know how to do that. I've started the request, but our scheduler has not started the run yet.

Mark

mesamuels commented 2 years ago

The run is going.

mesamuels commented 2 years ago

Hah I went out to do an errand and when I came back it was done already, the scheduler says it only took 22 minutes of real time. Now to understand the output. I see a new fasta file named project-long-scaffs.fa. However when I scroll through it (just a little), I don't seem to see super long contigs. Is this file still the original assembly contigs, or are there bigger contigs in here now further on than I looked? There is also a sam.gz file, is that where the new scaffolds actually are? Here are the stats from the original contigs, scaffolds, and long scaffolds. Did much happen?

n n:500 L50 LG50 NG50 min N75 N50 N25 E-size max sum name 2717425 466637 85438 234703 1061 500 1415 2859 5178 3829 38261 863.3e6 Cber1-contigs.fa 2713092 463398 83561 231639 1060 500 1422 2913 5296 3908 41438 863.1e6 Cber1-scaffolds.fa 2673354 427706 60777 197009 1060 500 1492 3584 7518 5503 55607 862e6 Cber1-long-scaffs.fa

mmokrejs commented 2 years ago

I told you it is quick. It did merge some of the scaffolds even further. Not that much but that can be attributed to the difference between the two species. The new scaffolds are in the Cber1-long-scaffs.fa file. That is the file to work with.

The sam.gz file contains the predicted mRNA parts mapped onto the Cber1-scaffolds.fa pieces. If you feel brave you can try to figure out how many mRNA's did NOT get mapped to the Cber1-scaffolds.fa contents due to too many discrepancies. In theory, you can tweak the bwa mem to allow for more mismatches. You can study the unaligned reads in the sam.gz file, they have * in some colum. Or better, follow this manual and use samtools to do the filtering: http://www.novocraft.com/documentation/novoalign-2/novoalign-ngs-quick-start-tutorial/1040-2/ . This way you find transcripts which are too dissimilar between the species or which exons are somehow not represented in your assembly. You can convert them back to FASTA format and run e.g. blastn on your Cber1-long-scaffs.fa to see visually the supposedly many differences causing the mRNA pieces to be unaligned.

mesamuels commented 2 years ago

Thanks Martin, I will look into this. If I understand, the long-scaffs.fa file should be improved over the scaffolds.fa file, though in this case it did not seem to make a big difference. I was going to try blasting these files with quinoa gene sequences anyway, to see how many different contigs/scaffolds a given gene is spread out over. However, in this case there is a bigger issue to deal with first, namely that the total assembly is still just a bit more than half the estimated genome size. Since there are two very similar subgenomes (the species is an allotetraploid hybrid arising from two similar diploids), I guess the two copies of many unique regions are merged into single contigs. I'd prefer to split these correctly first, before adding the transcript step. The differences between my species (Chenopodium berlandieri) and quinoa (Chenopodium quinoa) are almost as great as those between the two subgenomes within each species. So using the quinoa transcripts to superscaffold the berlandieri contigs may just be confusing the system, again more parameters to play with.

I realize that all of this is suboptimal, what I really should have are long-reads. However as I mentioned this is just a pilot, done with a bit of money, enough for the Illumina paired-end run which is a lot cheaper than PacBio or OxfordNano even now. I'm just trying to get as much out of the sequence I have as possible, I recognize that there are true limitations on the potential assembly due to all the repetitive sequences (these species have lots of copia elements apparently) plus the two similar subgenomes. However, if I can get funded to do additional sequencing with long reads and HiC scaffolding, all the learning I'm doing here will come in very useful indeed.

Meanwhile I think this issue can be closed, although I want to copy all the various comments out first. I'm not sure if we can do that or only the development team. I've opened a new issue just to ask whether they can load the newest version of Abyss here on the Canada supercomputer system, since they are in Canada. If not, that's another thing to learn!

Best of luck in your research. I see your latest paper is on hybridization and polyploidy, in fish species as a model. This is a big deal in plants of course. In the case of quinoa, not a lot of classical genetics has been done, mostly it's been field crosses. But in a few cases of specific mutations with phenotypes, the genetics behaves as if the plant is diploid, so at least for those genes, although there are 4 'haploid' copies, the two subgenomic copies do not appear fully redundant. But that may depend on the gene. I did some work with zebrafish, we made a morpholino knockdown for a gene we had found mutated in human patients. There were still two recognizable copies in the fish genome (2 haploid copies that is, so 4 total). We had to inject morpholinos targeting both to see a phenotype in the fish, which we did and it was consistent with the human genetics.

Cheers! Mark

mmokrejs commented 2 years ago

Hi Mark, I understand your concerns and thank you for you kind explanation of your intent. The long-scaffolds are not too great but you have to realize that the 550Mb missing are supposedly just the repeats. Assemblers assemble at first the non-repetitive part (the protein-coding genes, typically). So, you have their exons and what is typically missing are the repeats needed to fill up the gaps. That will not help you with your work I think.

I think it would help you to download the Cber1-contigs.fa, Cber1-scaffolds.fa and Cber1-long-scaffs.fa to your desktop, also the sam.gz file. Load them into IGV browser and inspect them. This is somewhat twisted analysis so read further below.

What would be more interesting is to map the Cber1 genomic reads onto the transcripts of quinoa in GCF_001683475.1_ASM168347v1_rna.fna , say using bwa mem. The SAM or BAM file with the results you can again download into your desktop, open in IGV, load the mRNA sequeces as the reference sequence and check, for which genes you can see the wished 4 alleles. For this type of work you did not even need to to do genome assembly. Just map the genomic reads onto the predcited mRNA from quinoa. bwa is a bit over-simplified too, better would be to use bowtie2, STAR or hisat2 and allow for the spliced reads. But I think it is not so imperative because your reads are short and therefore likely to cover just a single exon. Some will be unmapped or softclipped but you will get some results out.

Good luck! I don't think you need to assemble the genome up to a perfection, find at first the orthologous loci.

bcgsc / abyss

Estimate resource requirements for genome assembly with Abyss v2.2.5 #445