GiantSpaceRobot / FindFungi

A pipeline for the identification of fungi in public metagenomics datasets
16 stars 15 forks source link

Issue with running FindFungi.sh #12

Closed mradz19 closed 4 years ago

mradz19 commented 4 years ago

I am having trouble running the FindFungi.sh script. I'm not sure where in the script I am supposed to put the path to the kraken and skewer executables and I think that may be the issue. I have added the paths to the directories containing the databases and scripts etc.

However when I run bash FindFungi.sh reads.fastq EXAMPLE I just get this response: Started at Thursday 5 March 09:57:25 AEDT 2020 Finished at Thursday 5 March 09:57:25 AEDT 2020

Not sure what I need to change. Do I need to add anything to lines 17 and 18 in the script, being x and z?

GiantSpaceRobot commented 4 years ago

Hi there,

If you changed the the paths in lines 22-26 then that should be enough. Can you confirm that the directory named EXAMPLE has been created? If it has, please delete this directory and run the pipeline again.

Please send any more output generated by the pipeline. Thanks.

mradz19 commented 4 years ago

There was no directory named EXAMPLE created that I can see. I realised the problem was I didn't specify $z on the PreDir line.

However, I realised I cannot run the script now as I do not have BSUB on this server and I am unable to install it. If I remove BSUB from the script will the workflow still be able to run?

GiantSpaceRobot commented 4 years ago

Hi there,

Removing the bsub commands might work depending on a few things.

Do you have SLURM or any other workload manager? If so then it is possible to get it the pipeline up and running. If not, then it will depend on the system you are running. How much memory/RAM does the system have and what OS is it running?

mradz19 commented 4 years ago

Typical RAM availability is about 250 GB, there is 44 CPUs and the OS is Ubuntu 16.04 LTS. I don't believe SLURM is available on this server.

I did try running it anyway with BSUB removed, it seemed to be working well up until line 127, where I got an error saying:

tail: cannot open 'findfungi/output/P100/FindFungi/Processing/ReadNames_bsub.*.fsa' for reading: No such file or directory

The file is there but I did note that it is 0 Kb in size.

GiantSpaceRobot commented 4 years ago

Hi there,

Some of the bsub commands are complicated. Please change line 127 to: grep -w -A 1 -Ff $Dir/Processing/ReadNames.$Taxid.txt $Dir/Processing/Reads-From-Kraken-Output.$z.Reformatted.fsa --no-group-separator > $Dir/Processing/ReadNames_bsub.$Taxid.fsa & Replace line 135 with: $d > $Dir/Processing/ReadNames.$Taxid.fsa Let me know how you get on with this.

mradz19 commented 4 years ago

I've been trying to run FindFungi with the changes you suggested above. However when running kraken most of the classify stages are failing due to a failure in allocating memory. I assume that's that bsub was for? Of the 32 databases about 23 fail to run.

GiantSpaceRobot commented 4 years ago

The bsub commands are for running the code on a platform LSF load sharing facility (like SLURM). Now that you have removed them, all instances of Kraken are being executed on a single machine. This is very RAM intensive, but can be fixed by removing the ampersand (&) from the end of lines 65 (Kraken) and 136 (Blast). This means that these programmes will now run in series rather than parallel which will result in quite long run times. Let me know how you get on, thanks.

mradz19 commented 4 years ago

I made the relevant changes you suggested, but I am still getting the failure to allocate memory warning. Although now I am getting it after kraken has run against all of the databases instead of at the the beginning. Are there any settings I can tweak to help with memory allocation problems?

GiantSpaceRobot commented 4 years ago

Do you know what line the pipeline is failing on? Do all of the Kraken runs complete? i.e. Are there results files for all 32?

mradz19 commented 4 years ago

It is stopping at line 85:

for d in $Dir/Processing/SplitFiles_Kraken/*; do #Sort individual Kraken output files File=$(basename $d)

bsub -K -q C sort_parallel --parallel 16 -o $Dir/Processing/SplitFilesKraken/sorted$File -k2,2 $d &

sort -k2,2 $d > $Dir/Processing/SplitFiles_Kraken/sorted_$File &

The last file generated is AllClassified_sample, then all of the sorted_sample.1-32 files are 0 bytes with nothing written to them.

mradz19 commented 4 years ago

Just another update. I removed the ampersand and the program progressed further. But it is stopping at line 93 I think.

I see a sorted.sample.All-Kraken-Results.tsv file is generated, however it is 0 bytes in size, while the individual sorted files are about 3-4 GB each. The rest of the pipeline seems to pregress but all out the outputs from line 93 onwards don't have anything written to them.

GiantSpaceRobot commented 4 years ago

Hi there,

If the pipeline is crashing there it may be a memory issue. Can you try making a subset of your FASTQ file (10000 reads or something) and run the pipeline with this?

mradz19 commented 4 years ago

Hi,

I tried with a sample containing 4 million reads instead of 50 million like the other one. The pipeline progressed through to line 157 but then stopped with the following error:

File "/mnt/nfs/home/private/my_local_scratch/findfungi/FindFungi-v0.23.3/LowestCommonAncestor_V4.py", line 9 ScriptPath=/mnt/nfs/home/private/my_local_scratch/findfungi/FindFungi-v0.23.3 #Location of downloaded python and shell scripts ^ SyntaxError: invalid syntax Done File "/mnt/nfs/home/30041036/my_local_scratch/findfungi/FindFungi-v0.23.3/LowestCommonAncestor_V4.py", line 9 ScriptPath=/mnt/nfs/home/private/my_local_scratch/findfungi/FindFungi-v0.23.3 #Location of downloaded python and shell scripts ^ SyntaxError: invalid syntax Done

I'm not sure what to make of this error. The script path I designated in LowestCommonAncestor.sh and .py is the same as what is in FindFungi.sh

GiantSpaceRobot commented 4 years ago

Hi there,

I'm glad to hear the pipeline got further than before.

As for the new error, I'm not sure what is happening here. I haven't seen errors in this script before. However, the path is /mnt/nfs/home/private in the first error and /mnt/nfs/home/30041036 in the second.

Also, I'm not sure if I picked this up wrong, but the script path should not be changed in LowestCommonAncestor.py.

mradz19 commented 4 years ago

Sorry fixed the error above should've both said private.

I realised that when I had changed the script path in the .sh script I had accidentally overwritten the .py script. Fixing that helped with the previous error.

I think it has pretty much worked all the way through! Just in terms of the results, I've noticed that the results.csv file is empty, and the results with false positives is quite a large file. However, none of the predicted fungal reads have any blast hits. Does that mean they were all predicted to be bacterial instead? This is the run statistics:

Number of reads in the raw input: 4285226 Number of reads after trimming: 4285225 Number of reads removed by Kraken: 3974369 Number of reads remaining after Kraken: 310856 Number of bacterial reads removed by BLAST: 310856 Number of reads predicted as fungal: 0

Am I interpreting that correctly?

GiantSpaceRobot commented 4 years ago

I'm glad to hear the pipeline is running as normal.

It does appear that FindFungi has removed all of the reads in the BLAST step. I have seen this before in a number of datasets. I would suggest running the pipeline with the example dataset (ERR675624 - https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR675624). If the result match the example results from the README, then your dataset doesn't contain any fungi detectable by FindFungi.

mradz19 commented 4 years ago

Hi Sorry,

I ran FindFuni using the ERR675624 dataset and I have the same problem. The output Final_Results_ERR_AllResults-Ica.sorted shows a list of fungal IDs yet the Pearson skewness score is N/A for each ID, and all of the lines report (no BLAST hits). The summary text file is again showing the number of reads predicted as fungal to be 0. So I think I am having a lot more issues here.

Looking at the All.Skewness.File it is showing that there was no score calculated and no BLAST hits.

Only thing I know of is I dont have Enscipt installed (unable to do that on this server) but am I correct in assuming that should have no effect on the BLAST and skewness step?

GiantSpaceRobot commented 4 years ago

No enscript should be fine, the text reports should still generate.

Are there any files in the folder .../Results/BLAST_Processing/?

mradz19 commented 4 years ago

Sorry I tried re running with the test sequences. The content of /Results/BLAST_Processing/ is 20 BLAST.* files, all of which are 0 bytes in size

mradz19 commented 4 years ago

It must be an issue with BLAST but I can't work it out. All the results csv files list fungal taxa but none of them have BLAST hits or pearson skewness scores.

Just confirming these are the lines related to BLAST and calculating skewness:

### BLAST against the genome of the predicted species
for d in $Dir/Processing/ReadNames_bsub.*.fsa; do
        File=$(basename $d)
        Taxid=$(echo $File | awk -F '.' '{print $2}')
        #tail -n +31 $d | head -n -6 > $Dir/Processing/ReadNames.$Taxid.fsa
        $d > $Dir/Processing/ReadNames.$Taxid.fsa
        blastn -task megablast -query $Dir/Processing/ReadNames.$Taxid.fsa -db $BLAST_DB_Dir/Taxid-$Taxid -out $Dir/Results/BLAST_Processing/BLAST.$Taxid -evalue 1E-20 -num_threads 30 -outfmt 6
done
wait

### Get the best hit for every read and determine how many chromosomes these reads hit, and how many times
### Calculate the Pearson coefficient of skewness for each species, and gather together all species skewness scores
for d in $Dir/Results/BLAST_Processing/BLAST*; do
        File=$(basename $d)
        Taxid="${File#BLAST.}"
        awk '! a[$1]++' $d > $Dir/Results/BLAST_Processing/Top-Hits.$File
        awk '{print $2}' $Dir/Results/BLAST_Processing/Top-Hits.$File | sort | uniq -c > $Dir/Results/BLAST_Processing/Hit-Distribution.$File
        python2.7 $ScriptPath/Skewness-Calculator_V4.py $Dir/Results/BLAST_Processing/Hit-Distribution.$File $Dir/Results/BLAST_Processing/Skewness.$File &
done
wait
cat $Dir/Results/BLAST_Processing/Skewness* > $Dir/Results/BLAST_Processing/All-Skewness-Scores &
GiantSpaceRobot commented 4 years ago

Hi there,

Those are indeed the lines of codes relating to blast and skewness score calculations.

Are there files in the directory that is looped over to carry out the blast search ($Dir/Processing/ReadNames_bsub.*.fsa)? If so, you could try manually blasting one of these against the corresponding blast database. If that is successful, have a look to see that the files named Processing/ReadNames.$Taxid.fsa are actually being generated.

mradz19 commented 4 years ago

There are a lot of file called ReadNames_bsub.*.fsa, there are also files named ReadNames.$Taxid.fsa, however nothing has been written to these files (0 bytes). I have tried running the blast step manually and I am getting this error:

File is not accessible: 'FindFungi/Processing/Readnames..fsa'
GiantSpaceRobot commented 4 years ago

Can you copy and paste your blast command here please?

mradz19 commented 4 years ago

I tried again, this was the command and error (I picked a random query file):

blastn -task megablast -query ReadNames_bsub.1117330.fsa -db ../../BLAST_DB/FungalGenomeDatabases_EqualContigs/Taxid-$Taxid -out BLAST.Taxid -evalue 1E-20 -num_threads 30 -outfmt 6
BLAST Database error: No alias or index file found for nucleotide database [../../BLAST_DB/FungalGenomeDatabases_EqualContigs/Taxid-] in search path [/findfungi/output/test::]
GiantSpaceRobot commented 4 years ago

Hi there,

$Taxid is a variable name that changes with each iteration of the for loop. Try changing this: -db ../../BLAST_DB/FungalGenomeDatabases_EqualContigs/Taxid-$Taxid to this: -db ../../BLAST_DB/FungalGenomeDatabases_EqualContigs/Taxid-1117330 and try again.

mradz19 commented 4 years ago

That seems to have worked, but I just got this error:

Warning: Sequence contains no data

I tried a few other larger ReadNames_bsub.*.fsa files but they output the same error, I checked the fsa files and there are sequence lists present

GiantSpaceRobot commented 4 years ago

Sequence lists or sequences? Can you paste the first few lines of one of these files here please?

mradz19 commented 4 years ago

Sure these are the first few lines:

>ERR675624.577_FC61KB9AAXX:4:1:1377:7330#GATCAG/1       CTTGTGATTTTTGCACATTAATTTTGCATCCTGAGACTTTGCTGAAGTTGCTTATCAGCTTAAGGAG
>ERR675624.955_FC61KB9AAXX:4:1:1472:4015#GATCAG/1       CTCTGCCTCCTAAAGTACTGGGATTACAGGCGGTAATAGTTAATTTTTTTTTTTTTTTTTTTTTTT
>ERR675624.1242_FC61KB9AAXX:4:1:1548:5587#GATCAG/1      TCCACAATGGTTGAACTAGTTTACAGTCCCACCAACAGTGTAAAAGTGTTCCTATTTCTCCACATCCTCT
>ERR675624.1580_FC61KB9AAXX:4:1:1628:19355#GATCAG/1     GAACCAAAAAAAGGCCCACATAGCCAAAGCAAGACTAAGCAAAAAAAAAAAAAAAAAAAAAA
>ERR675624.3311_FC61KB9AAXX:4:1:2038:3317#GATCAG/1      CTTGAATTAATTTTTGTATAAGATGTAAGGAAGGGATCCAGTTTCAGCTTTCTATATATGGCTAGCC
GiantSpaceRobot commented 4 years ago

Ah I see. That FASTA file is not formatted correctly. I'm not sure why that is. If you run the following it should correct it: sed 's/ \+/\n/g' YOUR-FILE.fa > YOUR-NEW-FILE.fa Rerun your BLAST command and it should work. If that does work, then that explains why the for loop running the BLAST step wasn't working. You can replace the line $d > $Dir/Processing/ReadNames.$Taxid.fsa with sed 's/ \+/\n/g' $d > $Dir/Processing/ReadNames.$Taxid.fsa

mradz19 commented 4 years ago

I ran that on the test fsa file and the output still matches the original file:

>ERR675624.577_FC61KB9AAXX:4:1:1377:7330#GATCAG/1       CTTGTGATTTTTGCACATTAATTTTGCATCCTGAGACTTTGCTGAAGTTGCTTATCAGCTTAAGGAG
>ERR675624.955_FC61KB9AAXX:4:1:1472:4015#GATCAG/1       CTCTGCCTCCTAAAGTACTGGGATTACAGGCGGTAATAGTTAATTTTTTTTTTTTTTTTTTTTTTT
>ERR675624.1242_FC61KB9AAXX:4:1:1548:5587#GATCAG/1      TCCACAATGGTTGAACTAGTTTACAGTCCCACCAACAGTGTAAAAGTGTTCCTATTTCTCCACATCCTCT
>ERR675624.1580_FC61KB9AAXX:4:1:1628:19355#GATCAG/1     GAACCAAAAAAAGGCCCACATAGCCAAAGCAAGACTAAGCAAAAAAAAAAAAAAAAAAAAAA
>ERR675624.3311_FC61KB9AAXX:4:1:2038:3317#GATCAG/1      CTTGAATTAATTTTTGTATAAGATGTAAGGAAGGGATCCAGTTTCAGCTTTCTATATATGGCTAGCC
mradz19 commented 4 years ago

The command I used:

sed 's/ \+/\n/g' ReadNames_bsub.1501327.fsa > N_ReadNames_bsub.1501327.fsa
GiantSpaceRobot commented 4 years ago

Hi there,

Sorry, I assumed those were spaces but they may be tabs. If so, this should work: sed 's/\t/\n/g' ReadNames_bsub.1501327.fsa > N_ReadNames_bsub.1501327.fsa

mradz19 commented 4 years ago

That did the trick! I will change the line in FindFungi.sh and re run the pipeline, I'll report back how it goes.

mradz19 commented 4 years ago

Pipeline finished with expected results! Thanks for your lengthy help with this, I'll be running my samples now.