Runner script suggestion

nickp60 commented 6 years ago

Hi all, I want to suggest that a runner script could be added to blobtools. None of the inputs are necessarily something that users would have already prepared (metagenomic assembly, custom blast results, mapping file), and a script such as this one could help users go from reads and reference to plots with minimal head-scratching.

This blob_check.sh is meant to fill that gap (and its what I use when I need to check for contamination with blobtools). It first partitions the reads into three sets: the full set, those aligning to a reference, and those failing to align. For each of those read sets, it performs an assembly with metaSPAdes, blasts the results, and generates a blobplot. It takes 5 args:

reference
Forward reads
Reverse reads
output dir
name of BLAST db

The script checks to ensure that the user has provided all the args, has the required tools available in the PATH, and ensures that the BLASTDB variable has been set. A more flexible and robust script could be made in python or something, this is just my minimal implementation.

Thanks for your time, and for such a great tool!

DRL commented 6 years ago

Hi nickp60,

nice script, well done! However, I am kind of reluctant to merge this in as is...

I will try to explain my reasons below and hopefully we can find a solution that increases global net happiness...

Universality: BlobTools was meant to be a "somewhat" universal solution. That means that people can use their favourite assembler, mapper, taxonomy-inference software for whatever workflow they want. I mention certain tools in the readme.io, but ultimately the choice of the assembly/mapping/taxonomy tools depend on dataset(s), genome(s), user(s), etc. For example, you are probably doing stuff with mixes of small genomes and are happy with metaSPAdes, but for somebody who works on "big" genomes (>1 Gb) your workflow would take weeks/months to complete (depending on hardware, obviously). Similarly, your script deals with one paired-end read library. Often, people have combinations of several read libraries with different insert sizes that have to be passed to the assembler separately... The point is that I have seen people use BlobTools in very different ways with very different input files, and think it would be very difficult to make a script that makes a lot of people happy. And this was one of the main reasons to make BlobTools a modular toolkit, so that people can build there own workflows depending on their needs.
Technical aspects: Many people run their workflows on SGE clusters (or some other type of batch queuing system). Your script would require substantial CPU/RAM (for assembly/mapping/blasting), but would bypass all that and probably upset some sysadmins/users. You also write SAM files (never write SAM files), but only use SPAdes k-mer coverage for BlobTools (which is a slightly different thing than mean base-coverage extracted from BAM files).
Maintenance: if a script like yours gets put on the BlobTools repo, feature requests will follow eventually. And as stated in the point above, turning this into something that makes most people happy might will require some work ...
Reduction in 'head-scratching': while your motivation is noble, I think a certain amount of head scratching is necessary for bioinformatics. It is very easy to make mistakes when assembling genomes, and offering a nice workflow without properly describing the use case (and separating it from other use cases) might do more harm than good.

What I would suggest is that you:

polish the script a little
write down your use case (what are you trying to assemble and what are the contaminants) as a readme
put it on a repo under your GitHub account

And I would be very happy to reference you on the BlobTools GitHub/Readme.io as an example of "Workflow B" (as in Laetsch & Blaxter, 2017). That way it would be clear that your workflow is for people that do similar things to what you do. And this way you would take care of future maintenance obligations.

Does that make sense? I don't want to come over as 'gate-keeping' the BlobTools repo (I actually want people to contribute more). It is just that I have thought about the whole pre-made workflow business in the past and decided that this will cause more harm than good.

But I am happy to discuss this is more detail if needed.

cheers,

dom

nickp60 commented 6 years ago

Hi dom,

Thanks for your reply! I am well aware that it is a very limited/naive use case, and I have also been bitten by the side affects of too-little headscratching. I think the idea of a separate repo that could perhaps be mentioned in the docs would be a good way forward.

When I was first trying to get up and running the Blobtools, my biggest hurdle was a the the documentation seemed to start partway through the process -- the user is expected to have a hits file. While it is great that the input is flexible enough to handle a hits file form many sources, it is a non-standard format, and it took me a while to go though the tutorial docs and the hits-file docs to figure out just how to get what I need from it. Perhaps the workflow in this script could be written up as a tutorial it would be better?

Now that I am more familiar with the whole workflow, I find myself needing to deviate little from one routine: compare blobs from the reads mapping to my reference to those not -- hence this script. It is a very limited use case, but perhaps with some tweaking, we could find a common ground between uber-specific and 100% modular.

What are the most common use cases that you hear about? I imagine it is probably all over the place, but are there any tends you are noticing things really picking up on?

Thanks again for your reply!

~Nick

DRL / blobtools

Runner script suggestion #65