Question about assemble long-read sequencing data with short-read polishing

huawen-poppy commented 9 months ago

Hello! Thanks for your helpful tool!

I am trying to assemble long-read sequencing data with short-read polishing. My long read data (nanopore direct RNA sequencing) are from the animals under 25 degree. Except their corresponding short reads under 25 degree, I have the short reads under 32 degree for the animals. I am wondering should I use all the short reads for polishing?

Actually, I have tried both options. But I have some problem in understanding the output log file. Could you please tell me what does G: |V| mean? also, |E|? What is 'dovetail reads'? Thank you very much!

For your reference, below is the out file for the 25 degree only:

K-mer counting with ntCard...

Parsing histogram file `/output/rnabloom_k25.hist`...
Unique k-mers (k=25):     10,559,615,295
Unique k-mers (k=25,c>1): 1,950,047,372
K-mer counting completed in 0.318s

> Stage 2: Correct long reads for "rnabloom"
WARNING: Reads were already corrected!
> Stage 2 completed in 0.0s

> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 1,218,002,332 overlap records in 4d 17h 52m 2s
total reads:    7,751,438
 - unique:      715,776 (9.23 %)
   - multi-seg: 47,949
Unique reads extracted in 1m 24s
Overlapping sequences...
Parsed 7,831,120 overlap records in 56m 2s
contained reads: 149,441
dovetail reads:  399,830
G: |V|=399,830 |E|=462,964
G: |V|=380,224 |E|=397,885
before: 768,079 after: 583,940
Laid out paths in 19.876s
Mapping sequences...
Mapping completed in 16h 58m 28s
Polishing sequences...
Polishing completed in 1d 0h 11m 0s
Overlapping sequences...
Parsed 6,713,069 overlap records in 42m 45s
contained reads: 178,727
dovetail reads:  306,120
G: |V|=306,120 |E|=359,686
Removing redundant vertexes...
G: |V|=287,034 |E|=312,892
Removing transitive edges...
G: |V|=287,034 |E|=305,263
Tallying read counts...
Counts tallied for 393813 sequences in 6m 26s
Pruning graph with read count information...
Supported edges: 167955
G: |V|=287,034 |E|=126,600
Extracting vertex sequences...
Sequences extracted in 13.72s
Extracting paths...
before: 583,940 after: 361,524
Laid out paths in 2.482s
> Stage 3 completed in 6d 12h 51m 17s
Total runtime: 6d 12h 51m 20s

Below is the output of 25 + 32 degree:

K-mer counting with ntCard...
Parsing histogram file `/output/rnabloom_k25.hist`...
Unique k-mers (k=25):     15,688,083,317
Unique k-mers (k=25,c>1): 3,388,854,165
K-mer counting completed in 0.606s

> Stage 2: Correct long reads for "rnabloom"
WARNING: Reads were already corrected!
> Stage 2 completed in 0.014s

> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 1,221,031,559 overlap records in 4d 20h 39m 3s
total reads:    7,687,482
 - unique:      709,716 (9.23 %)
   - multi-seg: 48,363
Unique reads extracted in 1m 9s
Overlapping sequences...
Parsed 7,983,918 overlap records in 57m 34s
contained reads: 152,251
dovetail reads:  404,128
G: |V|=404,128 |E|=474,146
G: |V|=383,714 |E|=406,297
before: 762,545 after: 575,064
Laid out paths in 19.756s
Mapping sequences...
Mapping completed in 17h 23m 5s
Polishing sequences...
Polishing completed in 1d 2h 37m 49s
Overlapping sequences...
Parsed 6,527,316 overlap records in 42m 34s
contained reads: 172,477
dovetail reads:  307,258
G: |V|=307,258 |E|=365,568
Removing redundant vertexes...
G: |V|=287,490 |E|=317,110
Removing transitive edges...
G: |V|=287,490 |E|=308,649
Tallying read counts...
Counts tallied for 390775 sequences in 6m 40s
Pruning graph with read count information...
Supported edges: 169926
G: |V|=287,490 |E|=124,109
Extracting vertex sequences...
Sequences extracted in 11.47s
Extracting paths...
before: 575,064 after: 358,714
Laid out paths in 1.991s
> Stage 3 completed in 6d 18h 30m 56s
Total runtime: 6d 18h 30m 59s

kmnip commented 9 months ago

I see that you started with 7.6 million reads in stage 3. You must have a lot of long reads as input!

Except their corresponding short reads under 25 degree, I have the short reads under 32 degree for the animals. I am wondering should I use all the short reads for polishing?

Yes, you should provide all short reads for polishing.

Actually, I have tried both options. But I have some problem in understanding the output log file. Could you please tell me what does G: |V| mean? also, |E|? What is 'dovetail reads'? Thank you very much!

|V| and |E| are the number of vertices and edges in the graph.

"Dovetail reads" are those that have dovetail overlaps, e.g.

Read 1:  ============
Overlap:       ||||||
Read 2:        ==============

huawen-poppy commented 9 months ago

Thank you for your explanation! I have a further question, for the assembled file, how could I know which transcript is the isform with which transcripts?

kmnip commented 9 months ago

RNA-Bloom does not report that information.

bcgsc / RNA-Bloom

Question about assemble long-read sequencing data with short-read polishing #68