majority clusters after correction was not included in isonform output

alexyfyf commented 7 months ago

Hi @aljpetri ,

I'm trying to run your tool on a simulated ONT data with about 6 million reads. I used the new pipeline script, and run without error. But when I checked the results, I found I have 29583 clusters after clustering step, and 11498 after correction (I've checked and think the smaller number is because low abundance clusters with less than 5 reads are removed). The number of reads after correction is about 3.9 million reads. But for the final isonform results, it's only 2700 transcripts. And the total reads supporting these 2700 transcripts is only about 46000 reads. I think that is not normal. Do you have any idea where to look for the reason? Any chance there is a bug? I can see the output folder of isonform contains the same number of clusterxxx_merged.fa and clusterxxx_merged.fa as the cluster number in correction output. But some of the files are empty.

Thank you so much. Alex

aljpetri commented 7 months ago

Dear Alex, Thank you very much for reporting the error. The results look indeed unexpected. Could you give me some more information about the data that you ran the isONpipeline on? -How did you simulate the data? Is the simulation tool capable of generating end to end (full length) transcripts? -How many transcripts (and possibly clusters) would you expect to be in the data set? -Which parameters did you run the isONpipeline with? -It is unexpected from our side that the correction step outputs less clusters than the clustering step did. Were there any errors when running the data? -For isONform it may actually happen that empty clusterxxx_merged.fa files are produced, as if we do not get any transcripts having the necessary support no transcript will be called from the cluster. Overall it seems that the data simulated reads are not full length as so few transcripts are reported. Possibly only the shortest transcripts get reported. Best, Alex

alexyfyf commented 7 months ago

Dear Alex, Thank you very much for reporting the error. The results look indeed unexpected. Could you give me some more information about the data that you ran the isONpipeline on?

-How did you simulate the data? Is the simulation tool capable of generating end to end (full length) transcripts? I used NanoSim to simulate cDNA reads from reference gencode fasta, I think it's simulating unstranded data, not sure if its end to end.

-How many transcripts (and possibly clusters) would you expect to be in the data set? I simulated ~45000 transcripts from ~12000 genes.

-Which parameters did you run the isONpipeline with? I used the ont_no_pychopper

-It is unexpected from our side that the correction step outputs less clusters than the clustering step did. Were there any errors when running the data? No error I can see from the log. I can see the difference are from those small clusters with < 5 reads are not reported in the results of correction. For example, from cluster 11497, the read number in cluster is <= 4 reads (cluster 11496 has 5 reads, and cluster 11497 has 4, and I assume the clusters are ordered from biggest to smallest). And in the correction folder, from 11497, these folder does not exist at all.

-For isONform it may actually happen that empty clusterxxx_merged.fa files are produced, as if we do not get any transcripts having the necessary support no transcript will be called from the cluster. The simulated reads have mean length 1200bp, and I can get decent results from RNAbloom2 (mean assembled transcript length ~1500bp), and just interested to try another tool to compare. So I think it's not likely the case.

Thank you so much for your reply. Do you think it might be helpful if I send you the log?

Cheers, Alex

ksahlin commented 7 months ago

Looks like the trans-nanosim and RNAbloom2 (same author-group) are using 11% error rates (typical ONT dRNA). cDNA ONT error rate have in the last years typically been around 6-7%. (isONpipeline is designed/uses parameters suitable for cDNA around 7%).

If reads have 11% errors on average, it may be that isONcorrect delivers a post error correction rate around 2-4% (see fig from isONcorrect paper below, panel B). It is possible that this affects isONform's ability to trace out isoforms with the default parameters.

Just a guess applicable only in case you have reads with 10-11% errors. I guess the only way to verify that is a simulation with typical cDNA error rates.

Screenshot 2024-04-04 at 13 48 49

alexyfyf commented 7 months ago

Looks like the trans-nanosim and RNAbloom2 (same author-group) are using 11% error rates (typical ONT dRNA). cDNA ONT error rate have in the last years typically been around 6-7%. (isONpipeline is designed/uses parameters suitable for cDNA around 7%).

If reads have 11% errors on average, it may be that isONcorrect delivers a post error correction rate around 2-4% (see fig from isONcorrect paper below, panel B). It is possible that this affects isONform's ability to trace out isoforms with the default parameters.

Just a guess applicable only in case you have reads with 10-11% errors. I guess the only way to verify that is a simulation with typical cDNA error rates.

Thank you, Kristoffer for your insights. just wondering is there any parameter I can change to make isonform work with this high error rate? I will also run it in some recent real data soon.

aljpetri commented 7 months ago

It looks like isONclust is doing a decent job looking onto the number of genes 12000 and clusters(~11500). A source of error could be isONcorrect due to using new parameters for minimizers. Those are quite a lot faster but may not cope all too well with the high error rate of the data. We would advise running the isONcorrect alogrithm with the old parameters (--k 9 --w 10 --max_seqs 1000), which should better cope with high error rates than the new parameters. Maybe this would already yield a better correction to improve the input for isONform. If this does not improve the results significantly you could increase the delta_len parameters for isONform to enable more reads to be merged and therefore getting a higher support per transcript..

alexyfyf commented 7 months ago

Thank you @aljpetri . Let me clarify a bit, so the output of clustering showed 29582 clusters (in which 11496 clusters with > reads and are corrected in following step). Do I only need to change isONcorrect parameter and leave the isONclust as is?

ksahlin commented 7 months ago

Let me clarify a bit, so the output of clustering showed 29582 clusters (in which 11496 clusters with > reads and are corrected in following step).

Yes, some low-abundance clusters (e.g. singleton reads) are typically expected due to the higher error rate. Still, it still seems reassuring that 11,496 clusters made the cutoff (given that 12,000 genes in the simulation) - while it doesn't give any evidence for the clustering quality, its reassuring that those numbers are in the same ballpark.

Do I only need to change isONcorrect parameter and leave the isONclust as is?

I think this is a good first fix. We are guessing the decay is due to the default parameters for isONcorrect, tuned for 6-7% error rates. Only in a second attempt I would try changing isONform parameters. Note that isONcorrect will take significantly longer with --k 9 --w 10 --max_seqs 1000

aljpetri / isONform

majority clusters after correction was not included in isonform output #18