Magdoll / cDNA_Cupcake

Miscellaneous collection of Python and R scripts for processing Iso-Seq data
BSD 3-Clause Clear License
257 stars 104 forks source link

chain_samples counts error? #162

Open christine-liu opened 3 years ago

christine-liu commented 3 years ago

Hi Liz,

Apologies if this is a redundant question - I think #7 might explain the results I'm getting too, but wanted to double check. I'm chaining together some iso-seq samples, and the counts seem to be really off. The abundance files that list count_fl and norm_fl give a total number of FL reads that is significantly less than the number of FL reads in the chained_count.txt file. For example, one sample's abundance file says there are 321152 FL reads while the chain_count.txt file says that that sample has 21658655 reads - this number also seems really odd to me b/c this sample was run on a single SMRT cell (combined with other samples), so 21M reads is way more than should be expected. The reason I'm not 100% sure that #7 explains this is b/c the abundance files now only contain count_fl and norm_fl, so there wouldn't be any input to chain_samples.py that provides the count_nfl, right?

Thanks, Christine

Magdoll commented 3 years ago

Hi @christine-liu , Yeah, so the number of total # FL reads in the original (pre-chaining) files contains FL reads that do not make it into the final collapsed isoforms. (ex: you could have 1000 FL reads, but after isoseq clustering, mapping, filtering by alignment quality etc, only 500 FL reads make it to the .abundance.txt file).

Hope this explains it. -Liz

christine-liu commented 3 years ago

Hi Liz,

Thanks for your response, but I'm not sure it really answers my question (or I'm totally misunderstanding something) I ran each sample through ccs, lima, isoseq3 refine/cluster, minimap2, collapse_isoforms_by_sam, and then generated counts files using get_abundance_post_collapse. One of these counts files (which is after the clustering, mapping, filtering by alignment quality) says that it has 321152 FL reads (and all the counts add up within the abundance file). When I chain that sample with all the other samples, the chain_count.txt now says that the same sample has 21658655 reads (far more than a single SMRTcell). Shouldn't the chained file be consistent with the abundance files that are listed in the config file, and even if that's not exactly how the counts work out, isn't 21M unreasonably large?

Thanks, Christine

Dongxu-Zheng commented 2 years ago

Hi Liz,

Thanks for your response, but I'm not sure it really answers my question (or I'm totally misunderstanding something) I ran each sample through ccs, lima, isoseq3 refine/cluster, minimap2, collapse_isoforms_by_sam, and then generated counts files using get_abundance_post_collapse. One of these counts files (which is after the clustering, mapping, filtering by alignment quality) says that it has 321152 FL reads (and all the counts add up within the abundance file). When I chain that sample with all the other samples, the chain_count.txt now says that the same sample has 21658655 reads (far more than a single SMRTcell). Shouldn't the chained file be consistent with the abundance files that are listed in the config file, and even if that's not exactly how the counts work out, isn't 21M unreasonably large?

Thanks, Christine

Hi Christine,

Have you ever figured it out? I have the same problem with my chained dataset. Thanks for any reply in advance.

Cheers, Dongxu

christine-liu commented 2 years ago

Hi @Dongxu-Zheng,

Sorry! I still haven't figured this out either. I haven't had the time to really carefully look through Liz's chaining code to figure out how the numbers are getting whacky. I'll let you know if I figure it out :)

-Christine