Magdoll / Cogent

Coding Genome Reconstruction using Iso-Seq data
BSD 3-Clause Clear License
60 stars 17 forks source link

Failed reconstructions with highly repetitive gene family #86

Closed mollydawson closed 3 years ago

mollydawson commented 3 years ago

Hi Liz,

I'm working through reconstruct_contigs.py as part of Cogent v6.1.0 to eventually collapse my isoseq transcripts. I started with just under 15k transcripts as input which generated 1,811 partitions and I was able to successfully generate their reconstructed contigs with the default k-mer size of 30, however, I'm stuck on the reconstruction for my largest gene family containing 3,450 transcripts (these are highly repetitive transcripts).

Per the troubleshooting failed reconstructions section example code, I re-ran the reconstructions with increasing k-mer sizes (I used intervals of 5 up to 1000 (I was unsure of the max k-mer size)), coupled with the --nx_cycle_detection parameter, and all reconstructions still failed.

Is it possible to still troubleshoot this reconstruction? If there's no potential reconstruction fix, I was thinking I could exclude this partition from the rest of the pipeline and collapse the transcriptome without those sequences and then explore my options with clustering them with CD-HIT.

Please let me know if you can provide me with any insight, thanks for your time!

Magdoll commented 3 years ago

Hi @mollydawson , A 3450 transcript family that is highly repetitive...is a big challenge! Would you mind sending me just this family? I can give it a shot. If so, please give me an email so I can request file upload 👍

-Liz

mollydawson commented 3 years ago

Hi Liz,

Thank you so much! molly_dawson@student.uml.edu

Magdoll commented 3 years ago

Hi @mollydawson - dropoff request sent.

Magdoll commented 3 years ago

Hi @mollydawson - thank you I got the file. Will work on it. Don't hesitate to use this issue to ping me if you don't hear back in case this slips off my mind.

Magdoll commented 3 years ago

Hi @mollydawson , I ran reconstruction with parameters reconstruct_contig.py . --max_split_in_size 40 -k 300 and was able to compress the results from ~3400 transcripts down to ~1600, which...isn't super, and you can keep trying a few k-mer sizes, but at some point I'd probably leave it at that, since this family is so highly repetitive.

-Liz

mollydawson commented 3 years ago

Hi Liz,

Thank you so much for your help/time with this. Much appreciated.

Best,

Molly

On Oct 6, 2020, at 4:15 PM, Elizabeth Tseng notifications@github.com wrote:



Closed #86https://urldefense.com/v3/__https://github.com/Magdoll/Cogent/issues/86__;!!IrdRlI43zQ!PuS_VpBAAxcYwoIgGY0RNnkw2esYdQ_G_tOIYfaZo-VzkZUsVfaObilB_gZrrp9AQXz1JGjI8w$.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/Magdoll/Cogent/issues/86*event-3847383492__;Iw!!IrdRlI43zQ!PuS_VpBAAxcYwoIgGY0RNnkw2esYdQ_G_tOIYfaZo-VzkZUsVfaObilB_gZrrp9AQXzonY4orw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/APXYVL4Q7KNXDIJF42IKYY3SJN3HBANCNFSM4RZXFM4Q__;!!IrdRlI43zQ!PuS_VpBAAxcYwoIgGY0RNnkw2esYdQ_G_tOIYfaZo-VzkZUsVfaObilB_gZrrp9AQXxv-fej2A$.