Closed sr320 closed 1 year ago
These are all downloaded and converted to FastQ. I'm in the process of gzip-ing them (which is going to take awhile).
After that completes, I was going to run these through QC, too. Or should I skip that part?
Also, do you have any preference on where they're stored (i.e. should I put them in Nightingales)?
QC sounds good - I would put an gannet - as not our own data no requirement to do Nightingales.
Notebook is here: https://robertslab.github.io/sams-notebook/2023/01/13/SRA-Data-Coral-SRA-BioProject-PRJNA744403-Download-and-QC.html
Data is stored in subdirectories, by species and library type here: https://gannet.fish.washington.edu/Atumefaciens/20230119-coral-fastqc-fastp-multiqc-PRJNA744403/data/
Directory tree file (text) might help navigate: https://gannet.fish.washington.edu/Atumefaciens/20230119-coral-fastqc-fastp-multiqc-PRJNA744403/directory-tree.txt
@kubu4 I am trying to download these RNAseq SRA files to our URI server to do some analysis. I looked at your notebook but I'm a bit confused on how to downloaded files from SRA to the remote server.
@AHuffmyer I have a notebook too :)
https://sr320.github.io/Apulcra/
/home/shared/sratoolkit.2.11.2-ubuntu64/bin/fasterq-dump.2.11.2 \
--outdir /home/sr320/ncbi \
--split-files \
--threads 27 \
--mem 100GB \
--progress \
SRR8601366
@AHuffmyer - Can you clarify which part you're confused by.
Step 2 in my notebook post shows the command:
/gscratch/srlab/programs/sratoolkit.2.11.3-centos_linux64/bin/prefetch \
--option-file /gscratch/srlab/sam/data/NCBI-BioProject-PRJNA744403-coral_metagenomics/SraAccList-PRJNA744403.txt
Thank you both! @sr320's script works for me!
Actually, I now have a follow up question. @sr320 for your script, did you find a way to download many files at once reading in names from a text file?
@kubu4 I used your approach and it resulted in a folder for each SRR# and inside of that is a .sra, that I then need to split/convert into the fastq files using your loop in your notebook. However, it looks like your script doesn't have a recursive setting the look for the .sra files within the directories that are created in the download. Did you have this problem?
@AHuffmyer technically Sam is doing it correctly using prefetch first, then converting. can you provide a uRL to what you are trying to do?
also fasterq-dump
can take a list of accessions.
The method @sr320 uses allows you to supply a list of accessions. Not sure how that list is delimited, but the help menu says you can do that.
something like this should work if you want to skip prefetch
for i in $(cat ../data/acc_list01.txt); do echo $i; date;
/home/shared/sratoolkit.2.11.2-ubuntu64/bin/fasterq-dump.2.11.2 \
--outdir /home/sr320/ncbi \
--split-files \
--threads 48 \
--mem 100GB \
--progress \
$i; done
where ../data/acc_list01.txt
looks like
SRR8601367
SRR8601368
SRR8601369
running now to confirm
Thank you both this is so helpful! I am trying to download the E5 Deep Dive samples from this project: https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=744403 for RNAseq to URI's server. And I also need to learn how to remote download from NCBI anyways. @sr320 I will try your code this morning!
@AHuffmyer - Can you show us the code you've used and plan on using?
Also, @sr320 's code should be able to be simplified by doing something like this:
/home/shared/sratoolkit.2.11.2-ubuntu64/bin/fasterq-dump.2.11.2 \
--outdir /home/sr320/ncbi \
--split-files \
--threads 27 \
--mem 100GB \
--progress \
SRR8601366 \
SRR8601367 \
SRR8601369
@kubu4 actually that does not seem to work... first thing I tried
Oh! Interesting! Help menu suggests that should work. Sorry!
Ah! When using sratoolkit.2.11.2-ubuntu64
with a space-delimited list of accessions (like in my comment above), the resulting error is generated:
fasterq-dump.2.11.2 int: string unexpected while executing query within virtual file system module - multiple response SRR URLs for the same service 's3'
Turns out, if we use an upgraded version, the command works! Quick test on Raven seems to have worked (or, at least, didn't throw an error):
/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_02/sam/ \
--progress \
SRR8601366 \
SRR8601367
This is the script I am running now:
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --error="download_sra_error" #if your job fails, the error report will be put in this file
#SBATCH --output="download_sra_output" #once your job is completed, any final job report comments will be put in this file
module load SRA-Toolkit/2.10.9-gompi-2020b
prefetch --option-file ../raw/SraAccList.txt -O ../raw #this creates a folder for each SRR in the .txt list and outputs in the raw data folder
for file in *.sra; do fasterq-dump "${file}" --split-files -O ../raw; done #need to run this to convert to fastq
Right now I am running only the prefetch
step and that is working well. I will next run the fasterq-dump step, but I anticipate having a problem with the recursive nature of the files. I have attached a screen shot of the directories that end up in my folder. I need to figure out how to run the fasterq-dump step recursively for the next step. Maybe there is a way to code the loop to look in the folder and .sra file with the same name from the SraAccList.txt?
Maybe there is a way to code the loop to look in the folder and .sra file with the same name from the SraAccList.txt?
I think this depends on how SRA Tool Kit was configured. If it's configured with a designated cache directory and the "Tools" menu indicates to use that cache directory, then the fasterq-dump
command automatically handles recursion into directories.
This is why I was able to run the fasterq-dump
command without anything "extra".
See this SRA Tool Kit page for more info:
https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration
Oh perfect! I'll run that next and let you know how it works.
Maybe there is a way to code the loop to look in the folder and .sra file with the same name from the SraAccList.txt?
Just tested - When set to use cache location, only SRA files are downloaded. No directory structure gets generated. Thus, using the for file in *.sra...
works, because there aren't any directories to descend into.
That makes sense. I will also try something like this fasterq-dump --outdir ../raw --split-files SRR*
to see if it could handle directories as well.
Also, if you can't (or don't want to mess around with the SRA Tool Kit configuration stuff), you can recursively parse through directories for specific files like so:
# Enable recursive globbing
shopt -s globstar
# Run fasterq-dump on any SRA file
for file in **/*.sra
do
fasterq-dump "${file}"
done
Woo hoo this works! Here is the entire script.
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --error="download_sra_error" #if your job fails, the error report will be put in this file
#SBATCH --output="download_sra_output" #once your job is completed, any final job report comments will be put in this file
module load SRA-Toolkit/2.10.9-gompi-2020b
prefetch --option-file ../raw/SraAccList.txt -O ../raw #this creates a folder for each SRR in the .txt list and outputs in the raw data folder
shopt -s globstar #Enable recursive globbing
# Run fasterq-dump on any SRA file in any directory and split into read 1 and 2 files and put in raw folder
for file in ../raw/**/*.sra
do
fasterq-dump --outdir ../raw --split-files "${file}"
done
#Remove the SRR directories that are no longer needed
rm -r ../raw/SRR*/
My notebook post on this topic is here: https://ahuffmyer.github.io/ASH_Putnam_Lab_Notebook/E5-Deep-Dive-RNAseq-Count-Matrix-Analysis/
With an eye toward running through CEABiGR pipelines
64 samples total - PRJNA744403
https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA744403