chrisquince / STRONG

Strain Resolution ON Graphs
MIT License
47 stars 9 forks source link

Bayesgraph taking too long #108

Closed marcomeola closed 3 years ago

marcomeola commented 3 years ago

I am running the STRONG pipeline on a deeply sequenced set of 8 samples with limited diversity at species level. The pipeline was running the bayespaths step for 10 days on the two remaining bins although it performed well on the other bins resulting in a Bin_XX_summary.txt file. After 10 days I stopped it as I thought there might have been stucked in a loop as the log file showed no update. The command was still running according to htop but basically no ressources where taken for the analysis. After rerunning the pipeline it restarted from the same step trying to resolve the two remaining bins but again it's now running for over a week now and the log file was last updated the day I restarted the pipeline.

My question are: Is there a way I can check whether the pipeline has stopped working correctly or it's just taking so long to do the job? Did anyone else experience such a long proccessing time of bayespath? Why does bayespath not take all available ressources to faster process the step?

I can attach the log files if requested. Look forward to reading your reply

chrisquince commented 3 years ago

I think I know the origin of the bug, and we are going post a fix soon. Could you share the log files? It might be useful if we could get the input files for one of those bins too, that would be the whole contents of:

subgraphs/bin_merged/Bin_xxx

as a zip then I can try reproducing this?

marcomeola commented 3 years ago

Thank you for your quick answer. I can give you access to my dropbox folder with the compressed folders of each of both bins as well as the log files of bayespaths. The size is quite big (154MB and 500MB) to attach them here. Please check your emails with access to the folder.

marcomeola commented 3 years ago

Any help on how to bypass the bug would be very much appreciated. Did you manage to access the files?

chrisquince commented 3 years ago

Hi Marco, I have had a look at this. The program is running OK but your input graphs look very strange. They are hairy indicating that they were not tip trimmed properly and there are multiple components for each gene. Have you adapted the pipeline somehow to you use an alternate assembler and binning? If not we will have to look at the graph extraction and cleaning steps.

marcomeola commented 3 years ago

Hi Chris, thank you for looking at this. The only par of the pipeline I've adapted is described here https://github.com/chrisquince/STRONG/issues/107 but did not prevent other bins from passing through the bayespaths step.

Some samples were resequenced several times at different lengths 75 bp to 150 bp. Could that have an impact although they were trimmed previously? In initial parameter to be set is the read length, which I set at 75.

Is there any other log file I could look at to spot the problem?

marcomeola commented 3 years ago

I had a deeper look at the graphs and guess you mean the strange hairy extensions between the contigs (see figure). That's actually very strange. The raw reads were trimmed using fastp with this command:

for file in *_R1*.fastq.gz
do
fastp --in1 ${file} --in2 ${file/_R1/_R2} --out1 ${file/.gz} --out2 ${file/_R1*.fastq.gz/_R2.fastq} --unpaired1 ${file/.fastq.gz}-SE.fastq --unpaired2  ${file/_R1*/_R2}-SE.fastq --correction --qualified_quality_phred 15 --cut_front --cut_tail_mean_quality 20 --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA --detect_adapter_for_pe --thread 10 --html ${file/_R1*}.html --json ${file/_R1*}.json
done

If this step appears to be so crucial for STRONG to work properly, it might be useful to give a best practice of raw reads preprocessing prior to start the pipeline. What tool did you use for trimming and which parameters?

Screenshot 2021-03-11 at 10 23 57

marcomeola commented 3 years ago

Maybe on a side note. The size of these hairy extensions are either about 75 or 150bp long, which is the size of the sequencing reads. Since the read length has to be indicated in the config.yaml, I wonder whether the pipeline only accepts homogeneous read lengths.

marcomeola commented 3 years ago

The reason for these hairy structures were due to the k-mer sizes selected. After rerunning the assembly step with 21,33,55 the graphs looked ok. Therefore I close this thread