dyxstat / ViralCC

ViralCC: leveraging metagenomic proximity-ligation to retrieve complete viral genomes
GNU Affero General Public License v3.0
15 stars 3 forks source link

Difficulty running 'ViralCC': Stuck at "Integrative graph construction finished and there are 60.0 edges in the integrative graph" #4

Closed xzzhouxi closed 7 months ago

xzzhouxi commented 9 months ago

Hello, I'm encountering an issue while running the 'ViralCC' program. Specifically, the program seems to get stuck at a certain point in the process with the message "Integrative graph construction finished and there are 60.0 edges in the integrative graph." The memory usage becomes unusually high, and the program doesn’t seem to progress further, causing it to run indefinitely.

Upon further investigation, I've noticed that the issue seems to be specific to one particular sample in my dataset. Other samples run successfully without any problems. I've checked the data format and content of this specific sample, and it appears to be similar to the others.

This suggests that the problem might be related to the data or format of this particular sample causing the program to hang at the "Integrative graph construction" stage.

Logs:

DEBUG | 2024-01-08 12:20:49,550 | main | ViralCC v1.0.0, released at 03/2022 DEBUG | 2024-01-08 12:20:49,551 | main | 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] DEBUG | 2024-01-08 12:20:49,551 | main | Command line: ./viralcc.py pipeline -v ../04_megahit_out/L351/best.contigs.fa ../04_megahit_out/L351/MAP_SORTED.bam ../07_checkv/L351/checkv_combined_name_for_viralcc.txt Test/L351 INFO | 2024-01-08 12:20:49,551 | construct_graph | Reading fasta file... DEBUG | 2024-01-08 12:20:52,638 | construct_graph | There are totally 5233 contigs in reference fasta INFO | 2024-01-08 12:20:52,737 | construct_graph | Filtering contigs by minimal length(1000)... DEBUG | 2024-01-08 12:20:52,746 | construct_graph | 0 contigs miss and 0 contigs are too short DEBUG | 2024-01-08 12:20:52,747 | construct_graph | Accepted 5233 contigs covering 103139214 bp INFO | 2024-01-08 12:20:52,747 | construct_graph | Counting reads in bam file... DEBUG | 2024-01-08 12:21:23,267 | construct_graph | BAM file contains 23952097 alignments INFO | 2024-01-08 12:21:23,269 | construct_graph | Handling the alignments... DEBUG | 2024-01-08 12:22:32,362 | construct_graph | Pair accounting: OrderedDict([('accepted pairs', 40202), ('map_same_contig pairs', 8965553), ('ref_excluded pairs', 0), ('poor_match pairs', 347396), ('single read', 5245795)]) INFO | 2024-01-08 12:22:32,541 | construct_graph | There are 240 viral contigs INFO | 2024-01-08 12:22:32,541 | construct_graph | There are 4993 potential host contigs INFO | 2024-01-08 12:22:32,541 | construct_graph | Write information of viral contigs and potential host contigs INFO | 2024-01-08 12:22:33,112 | construct_graph | the threshold of shared host contig is 4 INFO | 2024-01-08 12:22:33,112 | construct_graph | there are 0 edges in the host proximity graph INFO | 2024-01-08 12:22:33,112 | construct_graph | there are 60.0 edges in the Hi-C interaction graph INFO | 2024-01-08 12:22:33,113 | construct_graph | Integrate the Hi-C interaction graph and the host proximity graph INFO | 2024-01-08 12:22:33,113 | construct_graph | Integrative graph construction finished and there are 60.0 edges in the integrative graph

I expected the program to proceed beyond the "Integrative graph construction" step and complete the process within a reasonable timeframe.

I appreciate any assistance or guidance you can provide to resolve this issue. Thank you!

dyxstat commented 9 months ago

Hi,

Thanks for using our software. This problem is very weird. There are only a few viral contigs and the Hi-C graph is also very sparse; thus I expect the clustering step should be finished quickly.

Do you have other files ouput besides the log file?

Best

xzzhouxi commented 9 months ago

Hi,

Thank you for your prompt response and for looking into this issue.

Regarding additional output files, I've checked the directory and found a couple of relevant files:

prokaryotic_contig_info.csv: k141_2581,10438,71.948649166507 k141_2669,12820,37.30109204368175 k141_3739,15647,38.79976992394708 k141_3757,21628,61.27242463473275 k141_3796,19861,62.51447560545793 k141_3938,11722,40.496502303361204 k141_4277,11054,36.05934503347205 k141_5292,10049,55.199522340531395 k141_7693,11778,60.12905416878927 k141_7875,10111,39.72900801107704 ... This file contains information about prokaryotic contigs, including their IDs, lengths, and some numerical values.

viral_contig_info.csv: k141_10205,30530,42.99705207992139 k141_11078,16884,33.001658374792704 k141_11089,12837,32.26610578795669 k141_14599,14516,62.34499862220998 k141_16287,14185,34.65632710609799 k141_29388,14953,34.06674245970708 k141_34658,10804,37.69900037023325 k141_41759,30731,32.62503660798542 k141_47832,10062,63.804412641621944 k141_57227,20841,39.78216016505926 ... This file contains information about viral contigs, including their IDs, lengths, and numerical values.

The issue seems to arise during the "Integrative graph construction" step despite having sparse viral contigs and a straightforward Hi-C graph. I've rechecked the data formats, and they appear consistent with the other successful runs.

I'm uncertain if these files shed more light on the issue or if there's a specific aspect you'd like me to investigate further. Any guidance or additional steps you recommend would be greatly appreciated.

Thank you for your continued support.

Best

dyxstat commented 8 months ago

Sorry for the late reply. I have been fully occupied with finding a job recently.

Unluckily, I cannot find anything going wrong from the results you present.

Would you be willing to share your data with me? It may be the only solution I have.

I feel so sorry about that.

Best, Yancey