Closed marcomeola closed 3 years ago
I think I know the origin of the bug, and we are going post a fix soon. Could you share the log files? It might be useful if we could get the input files for one of those bins too, that would be the whole contents of:
subgraphs/bin_merged/Bin_xxx
as a zip then I can try reproducing this?
Thank you for your quick answer. I can give you access to my dropbox folder with the compressed folders of each of both bins as well as the log files of bayespaths. The size is quite big (154MB and 500MB) to attach them here. Please check your emails with access to the folder.
Any help on how to bypass the bug would be very much appreciated. Did you manage to access the files?
Hi Marco, I have had a look at this. The program is running OK but your input graphs look very strange. They are hairy indicating that they were not tip trimmed properly and there are multiple components for each gene. Have you adapted the pipeline somehow to you use an alternate assembler and binning? If not we will have to look at the graph extraction and cleaning steps.
Hi Chris, thank you for looking at this.
The only par of the pipeline I've adapted is described here https://github.com/chrisquince/STRONG/issues/107
but did not prevent other bins from passing through the bayespaths step.
Some samples were resequenced several times at different lengths 75 bp to 150 bp. Could that have an impact although they were trimmed previously? In initial parameter to be set is the read length, which I set at 75.
Is there any other log file I could look at to spot the problem?
I had a deeper look at the graphs and guess you mean the strange hairy extensions between the contigs (see figure). That's actually very strange. The raw reads were trimmed using fastp with this command:
for file in *_R1*.fastq.gz
do
fastp --in1 ${file} --in2 ${file/_R1/_R2} --out1 ${file/.gz} --out2 ${file/_R1*.fastq.gz/_R2.fastq} --unpaired1 ${file/.fastq.gz}-SE.fastq --unpaired2 ${file/_R1*/_R2}-SE.fastq --correction --qualified_quality_phred 15 --cut_front --cut_tail_mean_quality 20 --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA --detect_adapter_for_pe --thread 10 --html ${file/_R1*}.html --json ${file/_R1*}.json
done
If this step appears to be so crucial for STRONG to work properly, it might be useful to give a best practice of raw reads preprocessing prior to start the pipeline. What tool did you use for trimming and which parameters?
Maybe on a side note. The size of these hairy extensions are either about 75 or 150bp long, which is the size of the sequencing reads. Since the read length has to be indicated in the config.yaml, I wonder whether the pipeline only accepts homogeneous read lengths.
The reason for these hairy structures were due to the k-mer sizes selected. After rerunning the assembly step with 21,33,55 the graphs looked ok. Therefore I close this thread
I am running the STRONG pipeline on a deeply sequenced set of 8 samples with limited diversity at species level. The pipeline was running the bayespaths step for 10 days on the two remaining bins although it performed well on the other bins resulting in a Bin_XX_summary.txt file. After 10 days I stopped it as I thought there might have been stucked in a loop as the log file showed no update. The command was still running according to htop but basically no ressources where taken for the analysis. After rerunning the pipeline it restarted from the same step trying to resolve the two remaining bins but again it's now running for over a week now and the log file was last updated the day I restarted the pipeline.
My question are: Is there a way I can check whether the pipeline has stopped working correctly or it's just taking so long to do the job? Did anyone else experience such a long proccessing time of bayespath? Why does bayespath not take all available ressources to faster process the step?
I can attach the log files if requested. Look forward to reading your reply