Create finish script for pipeline

ptrebert commented 6 years ago

read temporary files containing quality information
derive additional quality information
~~normalize peak files (flag for BL overlap)~~
output: analysis metadata file

ptrebert commented 6 years ago

@karl616 Ran into a little bit of uncertainty - the final step of the pipeline uses a lot of config info; for example, a regular expression to derive labels for samples. Assuming that the pipelines are intended to generalize, what is you suggested solution strategy? Should the user specify labels in the sample annotation table? What about stuff like the analysis run id (what we store in the database), should things like that still be implemented?

karl616 commented 6 years ago

I don't think we can rely on a consistent file naming. I think mark and whether it's broad or narrow should be part of the config file. Analysis run id was our DEEP internal construct... Today I say; keep it as simple as possible and leave it out... am I missing something?

ptrebert commented 6 years ago

Well, not really, I guess... but then I am going to assume that there will be a human-readable label specified in the config (sample naming in, e.g. BLUEPRINT, was quite cryptic). That way, it is not going to be too ugly if filenames are not well-designed. Default behavior would then be to rely on "label" + "mark" to create a unique identifier that is understood by humans

karl616 commented 6 years ago

Yes, that's what I think as well. In that way the person running the pipeline did generate the label somehow and should then also understand it.

ptrebert commented 6 years ago

All right, I'll put something together under these assumptions...

ptrebert commented 6 years ago

Next point popped up: in the original pipeline, information like fragment length is extracted from the metadata files generated by the short read mapping pipeline (Heidelberg). Since the new DEEP pipelines should cover all steps starting from raw reads, this implies that infos like that are created by the pipeline itself - is there already a draft so that I can get an idea about how that information is collected during the mapping step and then passed on downstream? Or should all of this just be ignored (I guess not...)?

karl616 commented 6 years ago

There is a draft for the mapping if you look at the Wiebke branch, but it only follows the xml file. Not sure if the metadata collection was part of that. If it was, we probably skipped over it. Maybe they run CollectMetrics from Picard tools. We can pull the number out of such a file if that is the case. I don't think it should be ignored.

On Fri, Apr 27, 2018 at 6:27 PM, Peter Ebert notifications@github.com wrote:

Next point popped up: in the original pipeline, information like fragment length is extracted from the metadata files generated by the short read mapping pipeline (Heidelberg). Since the new DEEP pipelines should cover all steps starting from raw reads, this implies that infos like that are created by the pipeline itself - is there already a draft so that I can get an idea about how that information is collected during the mapping step and then passed on downstream? Or should all of this just be ignored (I guess not...)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepPipelines/deepChIPseq/issues/4#issuecomment-385018534, or mute the thread https://github.com/notifications/unsubscribe-auth/ADsaNS2MQXq1YFY5gmbTsAaCj8SHxgfMks5ts0bvgaJpZM4TdfYI .

ptrebert commented 6 years ago

Ok, in the XML GALv1, the last step is the execution of a custom Perl script that merges all the QC information. We can skip that (because... Perl...); the Picard CollectMultipleMetrics step is executed in the CWL draft - then we have to collect or merge that info by ourselves; feasible I guess, though I have no idea what the Picard QC output looks like. Seems doable.

deepPipelines / deepChIPseq

Create finish script for pipeline #4