Too many nodes and edges in family finding

fantastycrane commented 6 years ago

Hi @Magdoll , One family finding command from generated batch commands in my project could not get through in a rather long time (roughly two days). It stuck at process_kmer_to_graph phase and took about 100G memory. Output information shows that graph in this cluster contains 49018 nodes and 79611730 edges , while corresponding numbers in other clusters are much smaller. Is this normal?

Magdoll commented 6 years ago

Hi @fantastycrane , Did you run run_preCluster.py? It is possible this is an unusually abundant gene, or minimap2 is too happy to pick up low similarity hits.

If the issue persists, I will come up with a way to resolve it.

--Liz

fantastycrane commented 6 years ago

Hi @Magdoll ,

I did run run_preCluster.py, and preCluster.output.csv shows that 49643 sequences (67505 in total) were clustered into the same cluster. This number dropped to 27155 after I upgraded Cupcake. Still, it is too large for process_kmer_to_graph.

The same issue occurred in several different project. It is noteworthy HQ sequences combined with corrected LQ sequences were used when running Cogent. Corrected LQ sequences had been filtered with accuracy above 0.99.

Results were relatively normal when only HQ sequences from IsoSeq were used.

Magdoll commented 6 years ago

Hi @fantastycrane ,

I'd like to take a look at it. I've requested a file upload to your email.

--Liz

fantastycrane commented 6 years ago

Hi @Magdoll ,

Requested fasta file has been uploaded. Sequences with full length coverage and other information are HQ isoforms, while sequences only cluster id provided are corrected LQ isoforms filtered by accuracy above 0.99.

Magdoll commented 6 years ago

Received and will work on it.

fantastycrane commented 6 years ago

Any progress on this issue?

Magdoll commented 6 years ago

Working on it. Responding soon.

Magdoll commented 6 years ago

Hi @fantastycrane ,

I'm still exploring options for this problem. But I want to say I don't feel like including LQ in Cogent is necessarily a good idea. I know the LQ has been error corrected and filtered for 99% accuracy, but looking at the sequences themselves I'm still worried it's introducing more noise.

Another reason this problem generally would not occur for regular Iso-Seq HQ output, is that the redundancy you are experiencing in LQ --- that lots of LQ end up being in the same gene family --- would not happen because Iso-Seq cluster step is supposed reduce redundancy as much as possible.

I could try to get the graph cutting to work for larger graphs, but that may take some time.

One possibility is to cut that one big cluster (that caused the graph cut to crash) down to a smaller seed cluster (maybe using only HQ sequences), run Cogent, then align the LQ sequences in that cluster to remove anything that doesn't give more information.

I'll think about what's some ways to automate this process.

jayceejiao commented 6 years ago

Hi @Magdoll,

I met the same problem today. I have a family containing 43 K HQ and 141 K LQ sequences (corrected by Illumina reads), representing 65% of the total input sequences. I am not sure why this happened and how can I fix this problem? Can I just select those HQ sequences and use cd-hit to reduce the redundancy and then use the resulted sequences as the representative sequences for this family? Do you have a better suggestion? Thanks a lot.

Chen

fantastycrane commented 6 years ago

Hi @Magdoll ,

I have learned previously that LQ sequences corrected by Illumina reads were not totally trustworthy. LQ was only included to gain more isoforms without requiring too many sequencing cells. Hope that makes sense.

It would be much appreciated if your solution could work.

huajiachicat commented 6 years ago

Hello @Magdoll ,

I have the same issue when running family finding of large dataset. After preCluster, I found one bin has 62124 contigs in the isoseq_flnc.fasta. When I run the family finding command for this bin, their is an error message of out-of-memory issue:

1.1 Expected output isoseq_flnc.fasta.s1000k30.dist already exists. Just use it. WARNING: Output directory /storage/hpc/data/zzr25/pacbio_post_analysis/unalign_cogent/unaligned_total already exists making weight graph from /storage/hpc/data/zzr25/pacbio_post_analysis/unalign_cogent/preCluster_out/5818/isoseq_flnc.fasta.s1000k30.dist graph contains 58165 nodes, 70780629 edges performing ncut on graph.... slurmstepd: error: Exceeded job memory limit at some point. srun: error: lewis4-r640-hpc5-node845: task 0: Out Of Memory slurmstepd: error: Exceeded job memory limit at some point.

I am wondering what the problem is and appreciated if there are any solutions.

AxelMacFoly commented 5 years ago

Hello @Magdoll, the same appears on my dataset. But I have HQ isoforms ONLY coming from the isoseq3 pipeline (included in smrtlink v6). The whole data set harbors 670,000 HQ isoforms (because we have generated 54 RSII and 15 Sequel SMRT cells for that project). It ended up with 210,000 isoforms within the largest bin. I aborted the process_kmer_to_graph process for that bin, as it ran more than two weeks and took more than 350 GByte of RAM at the end. Which parameters in which scripts does it make sense to play with? Does it make sense to increase the number at the -f option in your hardcoded minimap2 implementation within your python script (AlignerRunners.py)? I ran it several times with different -f (0.00001, 0001, 0.001 and 0.02). At 0.02 it generates two bins with 50,000 and 46,000 isoforms. The number of orphans increases a little bit. Also the number of bins in total. The number of chimeras stays the same. Any help is appreciated.

Magdoll / Cogent

Too many nodes and edges in family finding #41