Error in rule filter_fastq

cliffbueno commented 6 days ago

Hello,

There seems to be an error at the fastq-fasta conversion, or fastq filtering step. Can you provide advice based on the error message below? I have 180 fastq samples and it seems to proceed through most of them but then quits. This happened to a different sample in a previous try. When I removed that one, then the error occurred on this one. Thanks!

Error in rule filter_fastq: jobid: 408 input: tmp/samples/Q#_683_concat.fastq output: tmp/samples/Q#_683_filtered.fastq, tmp/read_count/Q#_683/Q#_683_total_reads_post_filtering.tsv log: logs/filter_fastq/Q#_683.log (check log file(s) for error message) conda-env: /data/cliffb/ONT_Compare/ONTAmpSeq/ONT-AmpSeq-main/.snakemake/conda/c077327251f16e2049fb6bff0c5b389f shell:

    chopper         --minlength 250         --maxlength 300         --headcrop 22         --tailcrop 22         -q 20         -t 2         < tmp/samples/Q#_683_concat.fastq         > tmp/samples/Q#_683_filtered.fastq

    # Count total reads
    num_reads=$(($(wc -l < "tmp/samples/Q#_683_filtered.fastq") / 4))
    # Put into a temporary file
    echo -e "Sample Reads_Post_Filtering

Q#_683 $num_reads" > tmp/read_count/Q#_683/Q#_683_total_reads_post_filtering.tsv

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

PSSchacksen commented 6 days ago

Hi Cliff

Could you please provide the respective log files for the error and the run? (log: logs/filter_fastq/Q#683.log) Also, how are your files structured in the input dir.

Best regards P

cliffbueno commented 5 days ago

In logs/filter_fastq there are 136 of 180 file names there, but all of the log files are empty. In ONT-AmpSeq-main/data/samples I have 180 .fastq.gz files. In ONT-AmpSeq-main/tmp/read_counts there are 146 of 180 directories. In ONT-AmpSeq-main/tmp/samples there are 145 files. 10 are _concat.fastq including the one that gave the error. 135 are _filtered.fasta.

cliffbueno commented 5 days ago

Oh and this was the original script. Thanks for the help. cd ~/ONT_Compare/ONT_AmpSeq/ONT-AmpSeq-main

mamba activate snakemake

screen snakemake --cores 24 --use-conda --config include_blast_output=False db_path_sintax=database/rdp_16s_v18.fa length_lower_limit=250 length_upper_limit=300 quality_cut_off=20 input_dir=data/samples metadata=data/metadata/Metadata180.txt

PSSchacksen commented 4 days ago

Again it is very hard to assist in what goes wrong for your run without the log files. So please do provide one from a respective run, it may be named name.out instead of .log.

What might be happening is your thresholds are too narrow, resulting in empty files. Have you run the stats script on your samples and checked whether or not the samples giving you some trouble actually contain reads within your given filtering criteria? Using the stats script on all your samples provides a opportunity to check the content of your samples. If some of your samples does not contain reads within the given filtering criteria, it will result in a empty file potentially being deleted by snakemake and therefore causing an error.

Do let me know how it looks like after having run stats on the samples giving you some trouble and providing the log (.out) file for a troublesome file.

If it is because a file does not contain sequences within your given filtering criteria, it will cause an error as snakemake cannot run with empty files. Meaning that you should remove the samples without anything in them from your starting folder and rerun the script.

cliffbueno commented 4 days ago

Hi Patrick,

I wish I could provide a log file, but they're all empty! If I go to ONT-AmpSeq-main/logs there is capture_config.log which is empty, and three directories, convert_to_fasta, concatenate_fastq, filter_fastq. If I go into each of those directories, there are 137, 157, and 142 .log files, respectively, but they are all empty. In ONT-AmpSeq-main I also checked find ./ -type f -name "*.out" and it didn't find anything.

It's good to know that having samples with no filtered reads will cause an error. But that doesn't seem to be the case here. I ran the stats script. So when I run the main pipeline on all samples the "Error in rule filter_fastq:" occurs on sample Q#_3794. So I looked at the read length vs. quality plot for that sample. Also attaching here. There are reads that are between 250-300 bp long and above Q20 which are the two thresholds I set. LengthvsQualityScatterPlot_dot

PSSchacksen commented 4 days ago

Is there one in "path/logs/filter_fastq/" called "filter_fastq-sample=Q#_3794-numbers.out" or something similar?

There should also be a *_dot.html, could you try and open that for a failed file, then zoom into the 250-300 bp Read lengths range and check if there are reads? Just to ensure the length match when zoomed in.

Also could you do a "ls tmp/samples/" and check if the names look correct compared to your input files?

If these look good another thing you can try and do is: conda activate snakemake snakemake --unlock

this sould force a rerun the failed samples if they look good

cliffbueno commented 4 days ago

There are no .out files anywhere in ONT-AmpSeq-main/logs. I checked the .html file and can confirm that there are indeed reads 250-300 bp long and > Q20 for that sample.

There are 157 files in tmp/samples. 17 have sampleID_concat.fastq and 140 have sampleID_filtered.fasta. The sampleIDs are correct for all.

Okay, how do I use the snakemake --unlock command. Do I add --unlock as an argument in the main pipeline run, or once the run fails then do I make a separate snakemake --unlock call?

PSSchacksen commented 4 days ago

Could you provide a list in txt format of all input filenames and in another file a list of all the filenames in the path/tmp/samples/ ?

Simply in your console write the following two lines: conda activate snakemake snakemake --unlock

Then delete the tmp and log dirs and rerun your samples again.

cliffbueno commented 3 days ago

Here are the text files for the run prior to snakemake --unlock. Then after snakemake unlock, I got the same error in rule filter_fastq on sample Q#_3794. input_sample_names.txt tmp_sample_names.txt

PSSchacksen commented 2 days ago

After going through the code for log files we have figured out why they are empty when using the desktop version that you are using and are looking into it. I have also gone through the input and output names and cant seem to find a consistent reason for the unavailable files.

Could you provide a screenshot of the stats for C1R2 and C2R1, while zooming into the x-axis for the 250-300bp range using the html file?

If there is no issue within these could you perhaps try to use the local install of the ONT-AmpSeq, thereby the log files should provide more information and allow for more settings to be adjusted.

cliffbueno commented 2 days ago

Interesting. Thanks for checking that.

Okay here are the stats for those. C1R2 C2R1

I don't quite follow the Desktop vs. Local install difference. I was basically following the Example Usage section of the Read Me.

PSSchacksen commented 1 day ago

Could you by any chance rename all your files to exclude "_" in the names?

You are using the "screen snakemake" ONT-AmpSeq approach, instead of downloading the repository and running the "snakemake --profile path/profiles/config.yml" approach. Using this local installation of the workflow should provide the log files and more info (currently missing from the desktop version). See https://github.com/MathiasEskildsen/ONT-AmpSeq?tab=readme-ov-file#usage-of-workflow-local-repository-installation-aau-biocloud-hpc-users.

cliffbueno commented 10 hours ago

Ok I renamed everything without "_" but the same error occurred.

Then after editing profiles/biocloud/config.yaml and config/config.yaml for my computer and parameters, I tried snakemake --profile /profiles/biocloud. There were multiple errors including: Error in rule merge_read_count Error in rule capture_config Error in rule relabel_merge Log files were empty except for logs/capture_config/capture_config--16.out and capture_config--19.out

PSSchacksen commented 15 minutes ago

At this point I'm sadly running out of ways i can try to figure out what happens without the log files.

Is it possible to try and use this stats script on your files. It merges the .gz files into a single file. Then could you possible try and use the provided path/fastq/*.fastq files as input? (i renamed the .sh extension to a .txt to be able to attach it) nanoplot.txt

If that does not work, then could you do a subset of your samples e.g. 10-20 samples of which you know the pipeline can handle and try to run that?

I hope this works then.

If neither works, then potentially there need to check the integrity of some of the merged fastq files using software like: https://github.com/stevekm/fastq-checker

MathiasEskildsen / ONT-AmpSeq

Error in rule filter_fastq #8