bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

issues reading symlinks when using custom vcfanno config #2514

Closed matthdsm closed 5 years ago

matthdsm commented 6 years ago

Hi Brad,

We use a custom vcfanno config in our production environment. Up until now, we've used a full path to define the source files. When we update, I'd like to use a shortened path, like in the other configs e.g. variation/dbnsfp.txt.gz To do this, I've placed a symlink in the variation directory to the actual file. When running vcfanno, I get the complaint the file doesn't exist. Are there any file checks in place that have issues with symlinks? When I use the full path, everything runs OK.

Thanks M

chapmanb commented 6 years ago

Matthias; Sorry about the problem and thanks for the report. Can you please post the error message you're seeing? We're not trying to restrict symlinking but it's not totally clear where and what is failing so I don't know where to begin debugging.

Practically, we're going to have to start thinking about restricting random vcfanno input files here as these won't work with any type of CWL run going forward. CWL requires all input files to be categorized and pre-specified, so a file full of external links like this won't work. I've re-worked the GEMINI and somatic vcfanno inputs to take this into account and are going to have to think about better ways to handle this going forward. So having a more complete picture of what you're doing and how we could make it more formalized and supported in bcbio for CWL support would be generally helpful.

Thanks for the help debugging.

matthdsm commented 6 years ago

Hi Brad,

Problem is, there are no error messages, it seems the file just gets skipped over, so I can't provide much info here. My best guess is that at some point a file check is done that doesn't handle symlinks properly, and thus skips over the entire config.

In the case of CWL implementation, I totally get all files should be staged beforehand. However, it would be a huge loss in my opinion to be unable to add custom configs to vcfanno. It's a very powerful and highly customizable tool, which makes it worth having. But without the custom annotations, it loses a lot of it's appeal.

In our case for example, we add a config file to annotate with an in house built AF data source. If we lose this, it would mean we'd have to reannotate and recreate the gemini db as postprocessing, which would create a lot of overhead to retain functionality we already had.

Practically, would it be possible to provide guidelines on how to add custom configs that are CWL compatible? I can't imagine many issues with having to put your reference data and config in specific directories in order to have it work. In practice, this is what we've already been doing.

Cheers M

chapmanb commented 6 years ago

Matthias; Thanks for this description. You have a pretty custom setup here so we'll definitely have to do some thinking about how to support this in CWL going forward. Some things like VEP are going to be a challenge in CWL and I'm not entirely sure how best to tackle them. VEP has a large directory full of files and I'm worried about the slowdowns with tarring this up like we can do with the more compact snpEff inputs. Some of the other files you use are also pretty large which makes staging a challenge as you start to move to non-shared filesystem inputs.

From my side it would be helpful if you could isolate which of the custom additions we have in bcbio that we really need. There has been a lot of customization and the more we can simplify the easier it will be to port and maintain going forward. For your custom data sources, are these something sharable we could automate and include in bcbio? As part of thinking about what data is useful and helpful, enumerating and including these could benefit others as well and make this more maintainable/less custom for you.

In general, I just want to have a smooth transition to CWL, try and simplify where possible, and help maintain the parts of your workflow that you need so am trying to understand how best to support this. Thanks again for the helpful discussion.

matthdsm commented 6 years ago

Hi Brad,

I understand the difficulties. I do think passing up on VEP would create a serious backlash from the community. The annotations from Ensembl are vital to us as a standardized resource.

What do you see as custom additions? We do use quite a lot of config params for extra flags, but those were requested specifically to make the output more useful downstream.

Sharing our custom data is impossible at this time. It's a collection of clinical data we use to detect trends in our local population (read: patients at our hospital), but that data can't be shared in any way without proper informed consents. Losing this as annotation would be a pity, but we could use the resource in another way if you decide to drop the feature.

Perhaps it would be useful if you open up a couple of issues yourself, explaining what issues you encounter transferring to CWL. This way the community could get a clear image of where you want to go and add their two cents where possible. I'm sure there's a lot of untapped expertise and ideas laying around, waiting to be used.

Cheers M

chapmanb commented 6 years ago

Matthias; Thanks much for this, it's helpful to understand your use cases and try to support them going forward with CWL. In general, you use a lot of custom annotations (VEP additions, dbNSFP, gnomad genome files) which we don't have a ton of experience with as we don't use them regularly. Many of these are pretty large in terms of download and preparation times, so I was hoping maybe you're not finding utility in all of them and we could pare things down somewhat to a smaller support space. Have you found everything you annotate with useful in downstream interpretation, or is there any room to investigate removing some of these? I'm just trying to figure out a minimal set of resources that would help you effectively report and filter variants. Thanks again.

matthdsm commented 6 years ago

Hi Brad,

We use mostly all available annotation sources. In practice (but I think this is very narrow), we count heavily on VEP + plugins and some prediction fields from dbNSFP and dbSCSNV. Everything else goes unused in our production pipeline. We've chosen VEP as main source for annotations, since all sources are well documented and 100% traceable. An added plus is that the cache is updated frequently, which enables us to stay up to date without having to put much work in ourselves.

In an ideal world, I think the best option would be to use a base set of annotations and enable the users to add stuff through config files. Perhaps it's possible to add a manual for users on how to create a docker volume (or something like that) catered to their specific usecase. It would be nice to be able to provide all this from a central point, but with all the special cases around, I think that's going to be impossible.

Downsizing the available annotations would imply breaking changes. I you choose to go that way, I would start cleaning up older datasets and double datasets (e.g. 1KG, ExAC, clinvar, ...). Most of those are included in VEP/SnpEff caches AND dbNSFP. No need to keep those sets around 3 times. And again, when the pregenerated caches are updated, so are the sources without any effort from our side.

Personally, I'm all for letting everyone do what they're best at, and using what's provided. That would mean letting organizations with huge knowledgebase / funding create and host annotations, and concentrate on making the greatest pipeline even greater.

Of course, that's just my 2 cents. \rant

Cheers M