Create generic pipeline init script

ptrebert commented 6 years ago

A "as simple as possible" init script usable for each DEEP pipeline that identifies itself (WGBS, ChIP-seq etc.) and loads - if necessary - required reference data/configurations from the reference repository and checks/validates the sample annotation table (entries in this table should follow the naming scheme of EGA/ENA). The sample annotation table is a tsv file (not XML, not XLS(X), not anything but tab-separated text...)

ptrebert commented 6 years ago

Some thoughts about the standardized sample annotation table that will also be read by the pipeline init script:

as discussed, as far as possible, rules derived from EGA/SRA metadata model should be used consistently (except for capitalization; all lowercase is easier)
File level: filepath and filename to specify a file location; filepath plus filename gives fullpath
Experiment level: use info from SRA 1.5 experiment
- Example: LibraryStrategy - Bisulfite-seq, ChIP-seq, ATAC-seq, DNase-Hypersensitivity etc
- @karl616 could you find out how NOMe is handled? I don't see an entry for that...
- Example: LibraryLayout - single/paired (+ nominal_length => that is insert size in our case)
Read level: use info from SRA 1.5 commons
- Example: read_type - forward, reverse

The sample annotation could contain an infinite number of fields (columns), but there has to be a required subset per pipeline type (the combination of library strategy and library layout should be sufficient to define that). Given that the init script should be shipped with every pipeline, this config should in turn be part of the reference repository - so I imagine the startup as follows:

0) clone pipeline repository, say, ChIP-seq 1) run pipeline init with parameter sample_table and working_directory 2) the init script shipped with each repository is dumb: it only knows its type, say, ChIP-seq, and where to find more configuration information (in the reference repository) 3) it downloads the additional configuration (if not already available locally from a previous init run), and then starts validating the sample annotation 4) last step is to check for reference files, download/prepare if necessary

Any additional thoughts?

karl616 commented 6 years ago

I'm not sure about the NOMe data, but we have submitted some to EGA already. I find a way to access the metadata though. I obviously think it deserves its own category, but worst case we call it Bisulfite-seq. And we have to stratify by some additional field. With the different WGBS strategies (tWGBS, TruSeq, EpiGnome, PBAT etc), this might be the case anyhow.

I posted a request for clarification: https://github.com/enasequence/schema/issues/3

I would consider the inclusion of flowcell, lane and perhaps index as a way to track batch effects when sequencing. I usually include this as read group information in the alignment file. Adapter sequence could also be relevant. To a degree this is probably pipeline-dependent though.

ptrebert commented 6 years ago

There is a field LIBRARY_CONSTRUCTION_PROTOCOL as part of the library description, that sounds like a potential candidate to capture tWGBS, EpiGnome etc. - only downside that this is a free form text field, so may cause some trouble. Good point about the batch effects - did you find an annotation for flowcell, lane etc in one of the schema files?

karl616 commented 6 years ago

Yes, that could work... Perhaps we build in our own check? I cannot find flowcell or lane explicitly, but as I understand it this is captured with the run file. That keeps the files separated and that would be sufficient... Not sure how to handle naming, but there is surely a way...

ptrebert commented 6 years ago

Yeah, sure, the init script accepts a couple of fix values for the lib construction protocol and that's it, no problem.

Do you already have an example sample annotation for one of your pipelines (WGBS or RNA) that is more or less complete that we could use as a starting point?

karl616 commented 6 years ago

Not yet... do you mean the EGA format or yml/json format? End of next week is realistic...

On Fri, Apr 27, 2018 at 6:37 PM, Peter Ebert notifications@github.com wrote:

Yeah, sure, the init script accepts a couple of fix values for the lib construction protocol and that's it, no problem.

Do you already have an example sample annotation for one of your pipelines (WGBS or RNA) that is more or less complete that we could use as a starting point?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepPipelines/deepDoc/issues/1#issuecomment-385023018, or mute the thread https://github.com/notifications/unsubscribe-auth/ADsaNf_zjoxrqE_j3TqKNZE7f0N4fsrsks5ts0k_gaJpZM4TfY5C .

ptrebert commented 6 years ago

I mean the "EGA"-style sample annotation (TSV table) plus the necessary references (also TSV? Would have the benefit that you could update this online in github...) so that we have one complete set of starting values for one pipeline.

karl616 commented 6 years ago

I added a draft... As I understand it, this is what would be required for ENA submission (only sequence information so far).

This being a ChIP-seq sample, I added an extra column for antibody. This is not a general thing, and should probably be specific to the chip-seq pipeline. Perhaps there can be a simple specification of extra information needed for each pipeline.

ENA/EGA requires insert size for fastq file submission... hence, I added a general number. I'm still of the conviction that we should extract this programatically from the alignment.

ptrebert commented 6 years ago

Thanks. I agree concerning the insert size/nominal length issue. The way I see it, we would have to combine the initial sample annotation with the final analysis metadata to automatically generate a EGA/ENA submission sheet. If we stick to the naming/terms extracted from the SRA schema files, this should be straightforward, though.

karl616 commented 6 years ago

I agree as well. I saw the vocabulary file. Would it be correct to map sample_label to sample_name in SRA.sample? And according to the definition of filename it can include a relative path. Maybe that could replace filepath? The only thing I'm missing is a reference to the reference. We could use taxon_id, but maybe it is better to explicitly state the target (mm10, hs37, hg19, hs38 etc). Looking over the bisulfite pipeline, this set of parameters should be sufficient there as well.

deepPipelines / deepDoc

Create generic pipeline init script #1