A pipeline for the identification of fungi in public metagenomics datasets. The FindFungi pipeline uses the metagenomics read-classifier Kraken with 32 custom fungal databases to generate 32 taxon predictions for a single read. These 32 predictions are combined to generate a consensus prediction. All reads are then BLASTed against their predicted genomes to generate read distribution skewness scores to select for the most likely true positives.
FindFungi-0.23 was built on an IBM platform load-sharing facility with 32 worker nodes. If you would like a SLURM version of this pipeline, or a version that can run on a standard Unix/Ubuntu machine, please navigate to the bottom of this README.
Download the pipeline, databases, associated scripts, prerequisites and other tools. Run the pipeline:
./FindFungi-0.23.3.sh /path/to/FASTQ-file.fastq Dataset-name
These instructions will hopefully allow you to get a copy of FindFungi up and running on your own compute-cluster/server for development or your own analyses. If using a non-IBM LSF compute cluster, change the 'bsub' commands to reflect your architecture. If using a single server, remove the 'bsub' commands.
FindFungi v0.23 was built using the following:
tar -xvfz Kraken_*.tar.gz
mv Kraken_* Kraken_DB_Directory/
Download the dataset ERR675624 from the European Nucleotide Archive database. This dataset contains fungal reads.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR675/ERR675624/ERR675624_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR675/ERR675624/ERR675624_2.fastq.gz
Gunzip these files and concatenate them, as we have no need to read pair information.
gunzip ERR675624_*.fastq.gz
cat ERR675624_*.fastq > ERR675624_both-pairs.fastq
Execute the FindFungi pipeline on this FASTQ file. We use nohup here to allow the pipeline to run in the background.
nohup ./FindFungi-0.23.sh /path/to/ERR675624_both-pairs.fastq ERR675624
The first command line argument (/path/to/ERR675624_both-pairs.fastq) points to your FASTQ file. The second (ERR675624) is the name FindFungi will use for this dataset. This name should be informative.
The .csv results should show the following:
#Taxon name,Taxid,Reads mapping to taxid,Reads mapping to children taxids,Pearson skewness score,Percent of pseudo-chromosomes with read hits
Candida sp. LDI48194,1759314,671,0,0.524623062587,100.0
Malassezia restricta,76775,378,0,0.496034792692,100.0
Candida tropicalis MYA-3404,294747,265,0,-0.265788716977,100.0
Dr Ali Snedden has kindly created a SLURM implementation of the pipeline: https://github.com/astrophys/FindFungi_adapted_for_slurm.
Please use this version if you intend to use FindFungi on a single Unix machine: https://github.com/GiantSpaceRobot/FindFungi_SingleServerVersion
This project is licensed under the MIT License.