RobertsLab / resources

https://robertslab.github.io/resources/
19 stars 11 forks source link

Download coral SRA data (RNAseq and WGBS) #1569

Closed sr320 closed 1 year ago

sr320 commented 1 year ago

With an eye toward running through CEABiGR pipelines

64 samples total - PRJNA744403

https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA744403

kubu4 commented 1 year ago

These are all downloaded and converted to FastQ. I'm in the process of gzip-ing them (which is going to take awhile).

After that completes, I was going to run these through QC, too. Or should I skip that part?

Also, do you have any preference on where they're stored (i.e. should I put them in Nightingales)?

sr320 commented 1 year ago

QC sounds good - I would put an gannet - as not our own data no requirement to do Nightingales.

kubu4 commented 1 year ago

Notebook is here: https://robertslab.github.io/sams-notebook/2023/01/13/SRA-Data-Coral-SRA-BioProject-PRJNA744403-Download-and-QC.html

Data is stored in subdirectories, by species and library type here: https://gannet.fish.washington.edu/Atumefaciens/20230119-coral-fastqc-fastp-multiqc-PRJNA744403/data/

Directory tree file (text) might help navigate: https://gannet.fish.washington.edu/Atumefaciens/20230119-coral-fastqc-fastp-multiqc-PRJNA744403/directory-tree.txt

AHuffmyer commented 1 year ago

@kubu4 I am trying to download these RNAseq SRA files to our URI server to do some analysis. I looked at your notebook but I'm a bit confused on how to downloaded files from SRA to the remote server.

sr320 commented 1 year ago

@AHuffmyer I have a notebook too :)

https://sr320.github.io/Apulcra/

/home/shared/sratoolkit.2.11.2-ubuntu64/bin/fasterq-dump.2.11.2 \
--outdir /home/sr320/ncbi  \
--split-files \
--threads 27 \
--mem 100GB \
--progress \
SRR8601366
kubu4 commented 1 year ago

@AHuffmyer - Can you clarify which part you're confused by.

Step 2 in my notebook post shows the command:

/gscratch/srlab/programs/sratoolkit.2.11.3-centos_linux64/bin/prefetch \
 --option-file /gscratch/srlab/sam/data/NCBI-BioProject-PRJNA744403-coral_metagenomics/SraAccList-PRJNA744403.txt
AHuffmyer commented 1 year ago

Thank you both! @sr320's script works for me!

AHuffmyer commented 1 year ago

Actually, I now have a follow up question. @sr320 for your script, did you find a way to download many files at once reading in names from a text file?

@kubu4 I used your approach and it resulted in a folder for each SRR# and inside of that is a .sra, that I then need to split/convert into the fastq files using your loop in your notebook. However, it looks like your script doesn't have a recursive setting the look for the .sra files within the directories that are created in the download. Did you have this problem?

sr320 commented 1 year ago

@AHuffmyer technically Sam is doing it correctly using prefetch first, then converting. can you provide a uRL to what you are trying to do?

sr320 commented 1 year ago

also fasterq-dump can take a list of accessions.

kubu4 commented 1 year ago

The method @sr320 uses allows you to supply a list of accessions. Not sure how that list is delimited, but the help menu says you can do that.

sr320 commented 1 year ago

something like this should work if you want to skip prefetch

for i in $(cat ../data/acc_list01.txt); do echo $i; date; 
/home/shared/sratoolkit.2.11.2-ubuntu64/bin/fasterq-dump.2.11.2 \
--outdir /home/sr320/ncbi  \
--split-files \
--threads 48 \
--mem 100GB \
--progress \
$i; done

where ../data/acc_list01.txt looks like

SRR8601367 
SRR8601368 
SRR8601369

running now to confirm

AHuffmyer commented 1 year ago

Thank you both this is so helpful! I am trying to download the E5 Deep Dive samples from this project: https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=744403 for RNAseq to URI's server. And I also need to learn how to remote download from NCBI anyways. @sr320 I will try your code this morning!

kubu4 commented 1 year ago

@AHuffmyer - Can you show us the code you've used and plan on using?

Also, @sr320 's code should be able to be simplified by doing something like this:

/home/shared/sratoolkit.2.11.2-ubuntu64/bin/fasterq-dump.2.11.2 \
--outdir /home/sr320/ncbi  \
--split-files \
--threads 27 \
--mem 100GB \
--progress \
SRR8601366 \
SRR8601367 \
SRR8601369
sr320 commented 1 year ago

@kubu4 actually that does not seem to work... first thing I tried

kubu4 commented 1 year ago

Oh! Interesting! Help menu suggests that should work. Sorry!

kubu4 commented 1 year ago

Ah! When using sratoolkit.2.11.2-ubuntu64 with a space-delimited list of accessions (like in my comment above), the resulting error is generated:

fasterq-dump.2.11.2 int: string unexpected while executing query within virtual file system module - multiple response SRR URLs for the same service 's3'

Turns out, if we use an upgraded version, the command works! Quick test on Raven seems to have worked (or, at least, didn't throw an error):

/home/shared/sratoolkit.3.0.2-ubuntu64/bin/./fasterq-dump \
--outdir /home/shared/8TB_HDD_02/sam/ \
--progress \
SRR8601366 \
SRR8601367
AHuffmyer commented 1 year ago

This is the script I am running now:

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab                  
#SBATCH --error="download_sra_error" #if your job fails, the error report will be put in this file
#SBATCH --output="download_sra_output" #once your job is completed, any final job report comments will be put in this file

module load SRA-Toolkit/2.10.9-gompi-2020b

prefetch --option-file ../raw/SraAccList.txt -O ../raw #this creates a folder for each SRR in the .txt list and outputs in the raw data folder  

for file in *.sra; do fasterq-dump "${file}" --split-files -O ../raw; done #need to run this to convert to fastq 

Right now I am running only the prefetch step and that is working well. I will next run the fasterq-dump step, but I anticipate having a problem with the recursive nature of the files. I have attached a screen shot of the directories that end up in my folder. I need to figure out how to run the fasterq-dump step recursively for the next step. Maybe there is a way to code the loop to look in the folder and .sra file with the same name from the SraAccList.txt?

Screenshot 2023-03-15 at 8 31 25 AM
kubu4 commented 1 year ago

Maybe there is a way to code the loop to look in the folder and .sra file with the same name from the SraAccList.txt?

I think this depends on how SRA Tool Kit was configured. If it's configured with a designated cache directory and the "Tools" menu indicates to use that cache directory, then the fasterq-dump command automatically handles recursion into directories.

This is why I was able to run the fasterq-dump command without anything "extra".

See this SRA Tool Kit page for more info:

https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration

AHuffmyer commented 1 year ago

Oh perfect! I'll run that next and let you know how it works.

kubu4 commented 1 year ago

Maybe there is a way to code the loop to look in the folder and .sra file with the same name from the SraAccList.txt?

Just tested - When set to use cache location, only SRA files are downloaded. No directory structure gets generated. Thus, using the for file in *.sra... works, because there aren't any directories to descend into.

AHuffmyer commented 1 year ago

That makes sense. I will also try something like this fasterq-dump --outdir ../raw --split-files SRR* to see if it could handle directories as well.

kubu4 commented 1 year ago

Also, if you can't (or don't want to mess around with the SRA Tool Kit configuration stuff), you can recursively parse through directories for specific files like so:


# Enable recursive globbing
shopt -s globstar

# Run fasterq-dump on any SRA file
for file in **/*.sra
do
  fasterq-dump "${file}"
done
AHuffmyer commented 1 year ago

Woo hoo this works! Here is the entire script.

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab                  
#SBATCH --error="download_sra_error" #if your job fails, the error report will be put in this file
#SBATCH --output="download_sra_output" #once your job is completed, any final job report comments will be put in this file

module load SRA-Toolkit/2.10.9-gompi-2020b

prefetch --option-file ../raw/SraAccList.txt -O ../raw #this creates a folder for each SRR in the .txt list and outputs in the raw data folder  

shopt -s globstar #Enable recursive globbing

# Run fasterq-dump on any SRA file in any directory and split into read 1 and 2 files and put in raw folder 
for file in ../raw/**/*.sra
do
  fasterq-dump --outdir ../raw --split-files "${file}"
done

#Remove the SRR directories that are no longer needed
rm -r ../raw/SRR*/

Screenshot 2023-03-15 at 9 53 45 AM

AHuffmyer commented 1 year ago

My notebook post on this topic is here: https://ahuffmyer.github.io/ASH_Putnam_Lab_Notebook/E5-Deep-Dive-RNAseq-Count-Matrix-Analysis/