Nucleomics-VIB / InSilico_PCR

extract long read subsequences from a pair of primers
5 stars 1 forks source link

Issue running on new genomic data #2

Closed MaestSi closed 4 years ago

MaestSi commented 4 years ago

Dear Stephane, I am trying out your tool on a set of genomic MinION reads; in particular, I would like to extract the sequence between a pair of primers. I created the InSilico_PCR conda environment, I installed the required dependencies and then I opened the InSilicoPCR.sh script to comment the part related to the downloading of the Zymo data, to set the forward primer and the reverse primer sequences and to modify the infile variable according to my .fq.gz file name. The first error I get is: `gzip: RawData/.fq.gz: No such file or directory` What do you think I am missing? Thanks, Simone

splaisan commented 4 years ago

you need to define $infile on the top of the script as it is used to define files and folder names later in the case of the repo script infile is "Zymo-PromethION-EVEN-BB-SN.fq.gz"

you can provide your own fastq.gz as $1 with that script and it should work, otherwise you will need to define 'infile' and 'name' manually in the code as they get used in different places

does this help?

MaestSi commented 4 years ago

I already set:

#infile=${1:-"Zymo-PromethION-EVEN-BB-SN.fq.gz"}
infile=${1:-"<sample name>.fq.gz"}

and I also tried providing it as the first argument, but it is not working. Any other suggestions?

splaisan commented 4 years ago

Just set infile to you file name that was put in RawData_"filename" folder or change the logic later on in the script. I will create a second script tomorrow to ease this

MaestSi commented 4 years ago

I think I solved this issue by adding: cp $infile ${data} just after:

# Raw data
data="RawData_${name}"
mkdir -p ${data}

Thanks for the prompt reply! Now it started running, but then I encountered another issue: The error is:

The ASCII quality encoding offset (64) is not set correctly, or the reads are corrupt; quality value below -5.
Please re-run with the flag 'qin=33', 'ignorebadquality', or '-da'.
Problematic read number 4:
<read sequence>
Offset=64
java.lang.Exception: Aborting.
        at shared.KillSwitch.kill(KillSwitch.java:108)
        at stream.FASTQ.quadToRead_slow(FASTQ.java:754)
        at stream.FASTQ.toReadList(FASTQ.java:646)
        at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:107)
        at stream.FastqReadInputStream.hasMore(FastqReadInputStream.java:73)
        at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:667)
        at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:656)

Simone

splaisan commented 4 years ago

Could you check the version of msa.sh you are using. I had the same issue while developing this and found that my installed bbmap tools was very old. I will check tomorrow which one I now run and post here . There are bbmap versions out there which are not latest one from https://sourceforge.net/projects/bbmap/

splaisan commented 4 years ago

Buongiorno Simone :-)

I checked my BBmap and it was installed from my conda environment.yaml file as: bbmap 38.67 h516909a_0 bioconda

can you try msa.sh --version and compare to this:

 msa.sh --version
java -ea -Xmx151530m -cp /opt/biotools/miniconda3/envs/InSilico_PCR/opt/bbmap-38.67-0/current/ jgi.FindPrimers --version
BBMap version 38.67
For help, please run the shellscript with no parameters, or look in /docs/.

did you install the dependencies as explained in the top of the shellscript or manually? I will replace the shell script by a more intelligent one today.

Cheers

MaestSi commented 4 years ago

Hi, I am running the latest BBmap version I found in anaconda channel: java -ea -Xmx51773m -cp /home/simone/miniconda3/envs/InSilico_PCR/opt/bbmap-38.69-0/current/ jgi.FindPrimers --version BBMap version 38.69 For help, please run the shellscript with no parameters, or look in /docs/. I first installed the dependencies manually after creating a InSilico_PCR conda environment and also tried with conda env create -f environment.yaml to install the required dependencies.

splaisan commented 4 years ago

few ideas.

Are your reads in phred base 64 or 33? my code was done for 33 as you see with qin=33.

What does read #4 look like when you parse your input?

Can you try to run the msa.sh command (left primer) in terminal after declaring all necessary variables on a single fastq from your splitted fastq and see what happens?

msa.sh -Xmx${mem} qin=${qual} in=${your_fastq.gz} out=${tmpout}/forward_hits.sam literal="${forwardp}" rcomp=t cutoff=${cut}

The error you see was also present in my initial runs and disappeared after changing BBmap version. You could also add to the msa.sh command a parameter to ignore or fix input errors ('qin=33', 'ignorebadquality', or '-da')

You may also mail the BBmap author and ask him for help (bbushnell -at- lbl.gov)

Hope this helps

MaestSi commented 4 years ago

The quality values of Read #4 consist of many '@' (that should represent a Qscore of 31) with some other symbols. Since these are ONT basecalled with Guppy, I think the quality offset should be 33 as for your data. I also tried changing qin=33 to ignorebadquality=t but it crashed again. I ran your command on a single fastq and it didn't give any errors. The output was:

Time:                           102.423 seconds.
Reads Processed:    500000      4.88k reads/sec
Average Identity:   81.60%
Min Identity:       32
splaisan commented 4 years ago

This suggests that the programs do it when they are happy with the input; I suspect that some of your fastq records are either empty or badly formatted. Maybe the record just before of just after #4 has an issue or your fastq is not in blocks of 4 lines somehow!

Did it work with the same fastq containing read 4 or with another one?

Could you run some validator on your whole fastq (maybe https://github.com/alastair-droop/fqtools) not tried but has a validate function, otherwise bioawk and testing for the presence of @ and + every second line of data

alternatively you could run the forward search on each fastq you have split until you get an error to identify the faulty bins

I re-ran my script with demo data and the latest bbmap build and it still works.

BTW:I just uploaded a new version of my script which may not fix your problem but could be better for your next run.

MaestSi commented 4 years ago

I retried running on the full dataset with the newer script, but as you expected it gave the same error. The reads giving problems are reads that have a lot of '@' in the fastq quality, I don't know if this might confuse the software that is expected to find '@' mainly at the beginning of the headers. I did not perform further checks, I'll let you know! Have a nice weekend, Simone

splaisan commented 4 years ago

Hi Simone, Can I close this now or are you still failing to use the code? Best S

MaestSi commented 4 years ago

Hi Stephane, I am still not able to run your code. I also wrote BBMap developer for asking help, but he has not answered back yet. However, I noticed that skipping the parallelisation you implemented, and simply running:

msa.sh in=reads.fastq out=in_silico_pcr_one.sam literal=$pcr_silico_primer_one qin=33 cutoff=0.85
msa.sh in=reads.fastq out=in_silico_pcr_two.sam literal=$pcr_silico_primer_two qin=33 cutoff=0.85
cutprimers.sh in=reads.fastq out=reads_trimmed.fastq sam1=in_silico_pcr_one.sam sam2=in_silico_pcr_two.sam qin=33 fake=f include=t fixjunk

did the job. Do you spot anything different in my 3 commands with respect to what you do in your script, a part from parallelisation? Simone

splaisan commented 4 years ago

Maybe you do not have parallel installed! What shows up when you type 'parallel --help'

Otherwise, you inputs or parameters may not reflect your server configuration. Did you check all parameters in the editable part of the script (version 1.1) and give only one argument in terminal being your input fastq.gz to the script?

!! the script is given for 80 threads and you may not have and 8g per job which is also a lot!

You may try to run the script with './InSilico_PCR.sh -d' (demo mode) after adapting all variables to match your machine.

Let me know if you get it to work

MaestSi commented 4 years ago

Hi Stephane, yes, I have parallel installed. Yes, I checked all parameters and gave only one input argument, being fastq.gz reads. I noticed that in your script, when calling cutprimers.sh the qin parameter is missing. Therefore, I tried adding qin=${qual} and now it works on my data. Regarding running time, I noticed that it takes more time to run with your parallelization script than running the 3 commands I wrote in the previous post; probably it might be due to the fact that my fastq.gz file is not that big. Out of curiosity, did you do any running time comparison with and without parallel with demo data?

splaisan commented 4 years ago

Thanks a lot Simone. I probably lost the qual parameter when i edited the script. I will correct this tomorrow. Yes i ran without parallel first and on the huge demo dataset it made a big change but you are perfectly right that on a small input the direct commands will be better

MaestSi commented 4 years ago

Perfect! I am closing the issue. And thanks for the running time information too! Simone