Open ptrebert opened 6 years ago
@karl616 Ran into a little bit of uncertainty - the final step of the pipeline uses a lot of config info; for example, a regular expression to derive labels for samples. Assuming that the pipelines are intended to generalize, what is you suggested solution strategy? Should the user specify labels in the sample annotation table? What about stuff like the analysis run id (what we store in the database), should things like that still be implemented?
I don't think we can rely on a consistent file naming. I think mark and whether it's broad or narrow should be part of the config file. Analysis run id was our DEEP internal construct... Today I say; keep it as simple as possible and leave it out... am I missing something?
Well, not really, I guess... but then I am going to assume that there will be a human-readable label specified in the config (sample naming in, e.g. BLUEPRINT, was quite cryptic). That way, it is not going to be too ugly if filenames are not well-designed. Default behavior would then be to rely on "label" + "mark" to create a unique identifier that is understood by humans
Yes, that's what I think as well. In that way the person running the pipeline did generate the label somehow and should then also understand it.
All right, I'll put something together under these assumptions...
Next point popped up: in the original pipeline, information like fragment length is extracted from the metadata files generated by the short read mapping pipeline (Heidelberg). Since the new DEEP pipelines should cover all steps starting from raw reads, this implies that infos like that are created by the pipeline itself - is there already a draft so that I can get an idea about how that information is collected during the mapping step and then passed on downstream? Or should all of this just be ignored (I guess not...)?
There is a draft for the mapping if you look at the Wiebke branch, but it only follows the xml file. Not sure if the metadata collection was part of that. If it was, we probably skipped over it. Maybe they run CollectMetrics from Picard tools. We can pull the number out of such a file if that is the case. I don't think it should be ignored.
On Fri, Apr 27, 2018 at 6:27 PM, Peter Ebert notifications@github.com wrote:
Next point popped up: in the original pipeline, information like fragment length is extracted from the metadata files generated by the short read mapping pipeline (Heidelberg). Since the new DEEP pipelines should cover all steps starting from raw reads, this implies that infos like that are created by the pipeline itself - is there already a draft so that I can get an idea about how that information is collected during the mapping step and then passed on downstream? Or should all of this just be ignored (I guess not...)?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepPipelines/deepChIPseq/issues/4#issuecomment-385018534, or mute the thread https://github.com/notifications/unsubscribe-auth/ADsaNS2MQXq1YFY5gmbTsAaCj8SHxgfMks5ts0bvgaJpZM4TdfYI .
Ok, in the XML GALv1, the last step is the execution of a custom Perl script that merges all the QC information. We can skip that (because... Perl...); the Picard CollectMultipleMetrics step is executed in the CWL draft - then we have to collect or merge that info by ourselves; feasible I guess, though I have no idea what the Picard QC output looks like. Seems doable.
normalize peak files (flag for BL overlap)