Assembling simulated reads

jakewendt commented 3 years ago

I am developing a script to simulate reads by taking a small (5mb) section of hg38 and randomly selecting regions in length from 1000bp to 15,000bp. Eventually I was going to add random error to these reads, however I wanted to try to reassemble these reads first for comparison. I had hoped that the result would be the initially selected sequence.

I selected 20,000 reads generating a 160mb nanopore.exact.fasta

If I run it simply with ...

shasta --input nanopore.exact.fasta

... it completes quite quickly but it discards many reads as being less than the minimum of 10,000bp. It creates an assembly file containing 39 contigs which is less than desirable.

I changed --Reads.minReadLength to 1000, but then everything seems to get dumped at or before the LowHash step.

2021-Jan-15 15:58:34.571230 LowHash0 iteration 0 begins.
Alignment candidates after iteration 0: high frequency 0, total 0, capacity 0.
...

I thought that I'd just create a better data set since I had that luxury and so I created a fasta file with 5,000 reads ranging from 10,000 to 100,000bp. Running shasta on this data set ...

shasta --input nanopore.large.exact.fasta

... also seems to ignore all reads at the LowHash step.

...
Read statistics for reads that will be used in this assembly:
    Total number of reads is 5000.
    Total number of raw bases is 273687311.
    Average read length is 54737.5 bases.
    N50 for read length is 70869 bases.
    The above statistics only include reads that will be used in this assembly.
    Read discarded because they contained invalid bases, were too short or contained repeat counts 256 or more are not counted.
2021-Jan-15 15:52:22.917920 Done loading reads from 1 files.
Read loading took 2.52523s.
Selected 104646 10-mers as markers out of 1048576 total.
Requested inclusion probability: 0.1.
Actual fraction of marker k-mers: 0.0997982.
The above statistics include all k-mers, not just those present in run-length encoded sequence.
2021-Jan-15 15:52:23.133373 Finding markers in 5000 reads.
2021-Jan-15 15:52:23.876131 Finding markers completed in 0.742721 s.
2021-Jan-15 15:52:23.876208 Finding palindromic reads.
2021-Jan-15 15:52:23.876377 0/5000
2021-Jan-15 15:52:40.046006 Flagged 2 reads as palindromic out of 5000 total.
Palindromic fraction is 0.0004
2021-Jan-15 15:52:40.049340 LowHash0 begins.
LowHash0 algorithm will use 2^24 = 16777216 buckets. 
2021-Jan-15 15:52:40.049383 Creating kmer ids for oriented reads.
There are 5000 reads, 10000 oriented reads.
2021-Jan-15 15:52:40.392807 LowHash0 iteration 0 begins.
Alignment candidates after iteration 0: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:41.572112 LowHash0 iteration 1 begins.
Alignment candidates after iteration 1: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:42.632666 LowHash0 iteration 2 begins.
Alignment candidates after iteration 2: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:43.638796 LowHash0 iteration 3 begins.
Alignment candidates after iteration 3: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:44.638895 LowHash0 iteration 4 begins.
Alignment candidates after iteration 4: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:45.646940 LowHash0 iteration 5 begins.
Alignment candidates after iteration 5: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:46.702136 LowHash0 iteration 6 begins.
Alignment candidates after iteration 6: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:47.744836 LowHash0 iteration 7 begins.
Alignment candidates after iteration 7: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:48.801683 LowHash0 iteration 8 begins.
Alignment candidates after iteration 8: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:49.851700 LowHash0 iteration 9 begins.
Alignment candidates after iteration 9: high frequency 0, total 0, capacity 0.
2021-Jan-15 15:52:50.904854 Storing candidate alignments.
Found 0 alignment candidates.
...

Does anyone have any suggestions?

paoloczi commented 3 years ago

Shasta default assembly parameters are not generally optimal for any particular situation. Assembly parameters generally need to be tuned to the data being used, and for this reason we provide a few sample configuration files in directory shasta/conf. I suggest that you start with configuration file shasta/conf/Nanopore-Sep2020.conf and then optimize from there. Use command line option --config to specify the configuration file. You can download the configuration files from GitHub or get them from the tar file for a release.

If you try that and you still don't get a satisfactory assembly, please post the entire assembly log output plus file AssemblySummary.html from the assembly directory and I can help. It would also help if you can post your entire input file, perhaps in compressed form, if you don't mind doing that.

From the information you gave me so far I can't tell what is going on, but given that the MinHash step finds nothing it is possible that there may be a problem in the code that generates the simulated reads. But I will be able to better assess if you provide the information I mentioned above.

jakewendt commented 3 years ago

Testing now. Thank you.

paoloczi commented 3 years ago

I am closing this due to lack of discussion, but feel free to reopen it or create a new one if new discussion items emerge.

chanzuckerberg / shasta

Assembling simulated reads #226