beaplab / transcriptome_metaT_quantification

Small pipeline to quantify transcriptomes from a metaT
0 stars 0 forks source link

Transcriptome vs metatranscriptomes pipeline

Objective

An easy to use pipeline to quantify fast a set of transcriptomes over a lot of metatranscriptome samples. It basically does:

In our nisaba system, I have previously computed for all of us the sourmash signature for all the metatranscriptomes I have downloaded.

In the metatranscriptomes dataset all info google sheet you can have the information from all the datasets present in nisaba so far. You should take a look at it and decide which ones you would like to analyze. Take a look at the Relevance column to choose!

How to

Preparing the stage

For all these analyses we will need:

In my example, I will quantify Florenciella.

mkdir florenciella_biogeography
cd florenciella_biogeography

# a folder for the data we will use
mkdir data 

# a folder for some of the scripts 
mkdir scripts

Once inside, to avoid copying and pasting outputs from previous processes, we can do a symbolic link to the transcriptomes of interest:

ln -s <path-to-your-dir-transcriptomes> data/transcriptomes     

Now we need to select the samples we are interested in quantify.

Sample sheet creation

In nisaba there is a csv sheet with all the paths to all the files with this structure:

group name fastq_r1 fastq_r2 single_end sig
2012_carradec_tara 004_0o8-5_DCM /scratch/datasets_symbolic_links/metatranscriptomes/2012_carradec_tara/004_0o8-5_DCM.fasta.gz NA TRUE /scratch/datasets_symbolic_links/metaT_signatures/2012_carradec_tara/004_0o8-5_DCM.zip

It presents all the information to avoid having individual copies for each. To obtain a subset of it, you can do it with an script I have created, named scripts/dataset_selector.R.

We need to download it to the folder we created previously:

mkdir scripts

wget https://raw.githubusercontent.com/beaplab/transcriptome_metaT_quantification/main/scripts/dataset_selector.R -O scripts/dataset_selector.R

We will use its output to quantify the desired samples. You have to previously choose which datasets you want to quantify, saving its nicknames from the highlighted column. You can run the script with the following structure:

Rscript scripts/dataset_selector.R 2012_carradec_tara,2021_tara_polar 

In this case, we are focusing in Tara and Tara Polar, but you may be interested into working with something else. It depends entirely on your species of interest. Check the Relevance section to decide. You can also run everything, it will take longer but it's ok.

It will have generated data/sample_sheet/<date>_dataset-selection.csv which will be the input of our pipeline.

In case you want to run your transcriptome against all the information out there, you can copy directly the csv from the location to your folder:

cp /scratch/datasets_symbolic_links/dataset_sheets/metatranscriptomes_datasets.csv data/sample_sheet/<date>_all-samples.csv

Running nextflow quantification

With all this information you will be able to run the whole pipeline.

Initially we will run a screen session for having the call in a background and being able to check the progress.

screen -R quantifying_<name_user>

This call will open a new session (you will have to press enter after running it). The -R flag is to reconnect, but given that there won't be any session with this name, it will create a new one. If we want to get out and continue with our lives, we have to press Ctrl + A and then Ctrl + D. This keystrokes are the way screen has for doing multiple functions. Ctrl+A gets you in the 'let's do things at the screen level' and Ctrl+D its the 'get me out of screen'.

We will test that everything is working correctly using a test sample sheet, with only 3 samples.

We can download it in a similar fashion than before:

wget https://raw.githubusercontent.com/beaplab/transcriptome_metaT_quantification/main/data/test_data/sample_sheet/dataset_correspondence_paths_test.csv -O data/sample_sheet/test.csv 

And we will also download a transcriptome in case you are following this example without one.

wget https://github.com/beaplab/transcriptome_metaT_quantification/raw/main/data/genomic_data/transcriptomes/nucleotide_version/EP00618_Florenciella_parvula.fna.gz -O transcriptomes/Florenciella_parvula.fna.gz

Let's test the quantification then:

nextflow run beaplab/transcriptome_metaT_quantification \
    --fastq_sheet data/sample_sheet/test.csv \
    --transcriptome "data/transcriptomes/*.fna.gz" \
    --outdir data/test_quantification -r main

A brief explanation of what is happening behind the curtain:

If everything worked well, we are good to go, and we can run the program. In my case it would be the following:

nextflow run beaplab/transcriptome_metaT_quantification \
    --fastq_sheet data/sample_sheet/<sample sheet>.csv \
    --transcriptome "data/transcriptomes/*.fna.gz" \
    --outdir data/quantification -r main

Nextflow should start the different processes automatically.

If for some reason we have to stop the program, we can continue it from where it was left behind adding the -resume flag. This is one of the perks of nextflow. The other is that it goes pretty fast at computing everything :)

And then we will continue with the analysis.

Remember to get out of screen! Ctrl + A and then Ctrl + D.

After a while, we may reconnect again with screen -R quantifying_<name_user> to check how everything is going on.

PD: In Nisaba you can find this folder with everything in position to better understand how to structure everything:

/home/aauladell/small_works/small_examples/florenciella_quantification_example

Adrià or someone else has updated the workflow: what now

Ok so with nextflow is quite easy to find new versions.

By typing:

nextflow pull beaplab/transcriptome_metaT_quantification

It will download the new version and prepare it to run.