erhard-lab / price

Improved Ribo-seq enables identification of cryptic translation events
10 stars 0 forks source link

Processing multiple samples vs. replicates #5

Closed TamaraO closed 6 years ago

TamaraO commented 6 years ago

If I have several replicates for a sample and would like to combine the reads, do I list them within the .json file in the datasets section?

Also, if I have multiple distinct samples (not replicates), do I need to create a separate pipeline for each sample with a separate json file?

Not sure if the serial vs parallel running mode have anything to do with this?

florianerhard commented 6 years ago

Dear Tamara,

I started to create a FAQ in the wiki, I hope this clarifies your questions:

I have several samples (replicates, conditions); how do I use the pipeline?

The pipeline works as follows:

  1. Read mapping: Reads from each sample (specified in the datasets property of the json) are mapped individually against the specified references (i.e. we will have one mapping file per datasets entry).
  2. Merging: Mapping files are merged into a single file (preserving the information which read mapping came from which sample, i.e. this is not pooling the samples)
  3. Reporting: When report.sh is used, several statistics are written into the report directory along with index.html, which can be loaded in any webbrowser to look at the reports
  4. PRICE: PRICE reads the merged mappings and computes its codon model. Either a single model is estimated for all samples, or, when using the -percond parameter (default when using the pipeline), for each sample a separate model.
  5. PRICE: Using the codon model(s), reads are mapped to P site codons, these are assembled into ORFs and scored using the generalized binomial test. The .orfs.tsv contains all ORFs (unfiltered) along with the number of reads for each sample.

Thus, if you have several conditions with several replicates each, you should create a single json file and process them together. To get the total number of reads for each ORF in each of the samples, look at the .orfs.tsv file.

What are the modes of execution (-r) of the pipeline?

The pipeline consists of several jobs (e.g. for each sample one mapping job, a merge job, a report job, a PRICE job). the mode of execution determines how jobs are executed. Currently, there are three modes of execution: serial, parallel and cluster. In serial execution, all jobs are executed one after another. In parallel execution, jobs are executed in parallel, if possible (by starting them as background processes and then use bash's wait command). In cluster mode, a grid management system (such as SLURM) is used. Importantly, in all cases the pipeline takes care of dependencies (e.g. merging the mappings can only be started once all mapping jobs are finished.

Florian

TamaraO commented 6 years ago

Hi Florian,

I have tried running gedi Pipeline on 4 replicates of the same sample and it seemed to start off well, but never finished. I saw it generating sam files (genomic and transcriptomic) in a temprary folder, and it created a bunch of output files:

login02:~/Ribo-seq/B721/Price/parameters/log$ ls -l
total 508
-rw-r--r-- 1 tamarao ribo_seq 12451 Mar 24 07:09 A0101.err
-rw-r--r-- 1 tamarao ribo_seq     0 Mar 22 22:34 A0101.out
-rw-r--r-- 1 tamarao ribo_seq 26471 Mar 24 07:43 A3303.err
-rw-r--r-- 1 tamarao ribo_seq     0 Mar 22 22:34 A3303.out
-rw-r--r-- 1 tamarao ribo_seq 16689 Mar 24 01:16 B1501.err
-rw-r--r-- 1 tamarao ribo_seq     0 Mar 22 22:34 B1501.out
-rw-r--r-- 1 tamarao ribo_seq 15087 Mar 24 04:08 B4402.err
-rw-r--r-- 1 tamarao ribo_seq     0 Mar 22 22:34 B4402.out
-rw-r--r-- 1 tamarao ribo_seq  1129 Mar 24 07:49 parameters.disambi.err
-rw-r--r-- 1 tamarao ribo_seq     0 Mar 24 07:48 parameters.disambi.out
-rw-r--r-- 1 tamarao ribo_seq 14684 Mar 24 07:48 parameters.merge.err
-rw-r--r-- 1 tamarao ribo_seq     0 Mar 24 07:43 parameters.merge.out
-rw-r--r-- 1 tamarao ribo_seq     0 Mar 24 08:01 parameters.price.out
-rw-r--r-- 1 tamarao ribo_seq     0 Mar 24 08:01 parameters.report.out

As you can see, price only created price.out, but no price.err and same for report. Price folder was never created. The logs don't report any errors. As an example, here is a log for one of the samples:

login02:~/Ribo-seq/B721/Price/parameters/log$ more A0101.err
adapter AGATCGGAAGAGCACACG
---
mRpm   million reads per minute
mNpm   million nucleotides per minute
mCps   million alignment cells per second
lint   total removed reads (per 10K), sum of columns to the left
25K reads per dot, 1M reads per line  seconds  mr mRpm mNpm mCps {error qc  low  len  NNN tabu nobc cflr  cfl lint   OK} per 10K
........................................   12   1  5.2  265   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12   2  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12   3  5.2  265   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12   4  5.2  264   85    0    0    0    0    0    0    0    0    0    0 10000
........................................   12   5  5.2  264   85    0    0    0    0    0    0    0    0    0    0 10000
........................................   12   6  5.2  265   85    0    0    0    0    0    0    0    0    0    0 10000
........................................   11   7  5.3  268   87    0    0    0    0    0    0    0    0    0    0 10000
........................................   11   8  5.5  280   90    0    0    0    0    0    0    0    0    0    0 10000
........................................   10   9  6.0  305   99    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  10  5.3  271   88    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  11  5.2  265   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  12  5.2  264   85    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  13  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  14  5.2  265   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  15  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  16  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  17  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  18  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  19  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  20  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  21  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  22  5.6  284   92    0    0    0    0    0    0    0    0    0    0 10000
........................................   10  23  5.8  295   95    0    0    0    0    0    0    0    0    0    0 10000
........................................   10  24  5.7  293   94    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  25  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  26  5.3  268   87    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  27  5.3  268   87    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  28  5.2  268   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  29  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  30  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  31  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  32  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  33  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  34  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  35  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  36  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  37  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  38  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  39  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  40  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  41  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  42  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  43  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  44  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  45  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  46  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  47  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  48  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  49  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  50  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  51  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  52  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  53  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  54  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  55  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  56  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  57  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  58  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  59  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  60  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  61  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  62  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  63  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  64  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  65  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  66  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  67  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  68  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  69  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  70  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  71  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  72  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  73  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  74  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  75  5.2  265   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  76  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   12  77  5.2  266   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  78  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
........................................   11  79  5.2  267   86    0    0    0    0    0    0    0    0    0    0 10000
................................
[reaper] check 0 errors, 0 reads truncated, 79821019 clean, 0 lint, 79821019 total
2018-03-22 23:42:45.488 INFO Command: gedi -e FastqFilter -D -ld A0101.readlengths.tsv -min 18 A0101.lane.clean
2018-03-22 23:42:47.558 INFO Discovering classes in classpath
2018-03-22 23:42:48.255 INFO Preparing simple class references
2018-03-22 23:42:48.981 INFO Gedi 1.0.2 (JAR) startup
# reads processed: 77602721
# reads with at least one reported alignment: 45746870 (58.95%)
# reads that failed to align: 31855851 (41.05%)
Reported 90471873 alignments to 1 output stream(s)
# reads processed: 31855851
# reads with at least one reported alignment: 21770463 (68.34%)
# reads that failed to align: 9954129 (31.25%)
# reads with alignments suppressed due to -m: 131259 (0.41%)
Reported 64311317 alignments to 1 output stream(s)
[bam_sort_core] merging from 8 files and 8 in-memory blocks...
# reads processed: 31855851
# reads with at least one reported alignment: 21245477 (66.69%)
# reads that failed to align: 10538502 (33.08%)
# reads with alignments suppressed due to -m: 71872 (0.23%)
Reported 99047444 alignments to 1 output stream(s)
[bam_sort_core] merging from 16 files and 8 in-memory blocks...
2018-03-24 04:47:56.116 INFO Command: gedi -e MergeSam -D -genomic human_g1k_v37 human_g1k_v37 -t /ahg/regevdata/projects/Ribo-seq/B721/Price/./parameters/s
cripts/A0101.prio.csv -prio /ahg/regevdata/projects/Ribo-seq/B721/Price/./parameters/scripts/A0101.prio.oml -chrM -o A0101.cit
2018-03-24 04:47:56.954 INFO Discovering classes in classpath
2018-03-24 04:47:57.470 INFO Preparing simple class references
2018-03-24 04:47:58.516 INFO Gedi 1.0.2 (JAR) startup
2018-03-24 04:48:03.022 INFO Reading oml /home/unix/tamarao/.gedi/genomic/human_g1k_v37.oml
2018-03-24 04:48:03.074 INFO Done reading oml /home/unix/tamarao/.gedi/genomic/human_g1k_v37.oml
2018-03-24 04:48:13.911 INFO Reading oml /ahg/regevdata/projects/Ribo-seq/B721/Price/./parameters/scripts/A0101.prio.oml
2018-03-24 04:48:13.936 INFO Done reading oml /ahg/regevdata/projects/Ribo-seq/B721/Price/./parameters/scripts/A0101.prio.oml
2018-03-24 07:09:01.629 INFO Command: gedi Nashorn -e println(EI.wrap(DynamicObject.parseJson(FileUtils.readAllText(new File('A0101.cit.metadata.json'))).ge
tEntry('conditions').asArray()).mapToDouble(function(d) d.getEntry('total').asDouble()).sum())
2018-03-24 07:09:02.919 INFO Discovering classes in classpath
2018-03-24 07:09:03.894 INFO Preparing simple class references
2018-03-24 07:09:05.064 INFO Gedi 1.0.2 (JAR) startup

Only the merge.err looks suspicious:

login02:~/Ribo-seq/B721/Price/parameters/log$ more parameters.merge.err

/ Writing 1+
- Writi
\
\ Writing 2-
| Writing
/ Wr
/ Writing 5-
- Writing 6
\ Writ
|
| Writing 9-
\ Writing 11+
- Writin
/
\ Writing 14+
- Writin
/
\ Writing 17-
- Writin
/
\ Writing 21-
- Writin
Processed 0 elements in 1m 24s 356ms (Throughput: 0.0/sec)

Not sure why it says it processed 0 elements??

Is there a way to re-start the pipeline or do I have to start from the beginning if it crashes?

It has created a bunch of files like parameters.cit:

login02:~/Ribo-seq/B721/Price/parameters$ ls -la
total 739701
drwxr-sr-x 5 tamarao ribo_seq       227 Mar 24 08:01 ./
drwxrwsr-x 3 tamarao ribo_seq       184 Mar 24 08:01 ../
drwxr-sr-x 2 tamarao ribo_seq       449 Mar 24 08:01 log/
-rw-r--r-- 1 tamarao ribo_seq 594016716 Mar 24 08:01 parameters.cit
-rw-r--r-- 1 tamarao ribo_seq       292 Mar 24 07:48 parameters.cit.metadata.json
-rw-r--r-- 1 tamarao ribo_seq      1005 Mar 22 22:29 parameters.params.json
-rw-r--r-- 1 tamarao ribo_seq  38854063 Mar 24 07:58 parameters.rescue.csv
drwxr-sr-x 2 tamarao ribo_seq       880 Mar 24 07:43 report/
drwxr-sr-x 2 tamarao ribo_seq       501 Mar 22 22:34 scripts/

I would appreciate any advice about making this work for multiple samples. Gedi Pipeline works like a charm for one sample!

Thank you!

TamaraO commented 6 years ago

After playing around with this some more, it seems that it's an issue with setting the execution mode to parallel -r parallel. If I run it in -r serial, it completes without issues.

florianerhard commented 6 years ago

Hm... the .merge.err indeed looks strange. These contents should not be there at all, unless it is called with -p. Could you post the start.bash script here? Merging seems to have worked, as you have your parameters.cit file created containing all read mappings.

Could it be that there were still some background processes running when you used parallel?

TamaraO commented 6 years ago

Hi Florian,

I am still confused about predicting ORFs for each sample independently versus merged. If I have -percond flag, that creates separate predictions for each sample. How do I run it in "merged" mode? It says in your Wiki that -percond is default? How do I turn it off?

Thank you!

florianerhard commented 6 years ago

Dear TamaraO,

-percond is only about the codon model. Without it, a single codon model is estimated (makes sense if there are enough reads in each condition, and if the protocol to prepare the samples was different from sample to sample and has led to different cleavage patterns around ribosomes). In either case, if you run price once (either with or without -percond), you get a single table with ORFs.

For the pipeline, the -percond parameter is used per default. At the moment, the only way to turn it off is to edit the start.bash script that is generated by gedi -e Pipeline ... (just remove the -percond from the line with gedi -e Price).

Best regards, Florian