PROPOSAL: diffs - Githubissues

lskatz commented 3 years ago

Hi, I am finding one aspect of ChewBBACA problematic: that it adds alleles in the same command that it analyzes. This leads to several problems including

Automatic errors if the database is on a read-only drive. It will err as soon as it tries to write. This has happened if I mount read-only with Singularity, for example. Or if there is a central read-only MLST database on our high performance computer (HPC) that everyone uses.
Pollution of the database. I queried with some bad assemblies and now the database is ruined. The only way to backtrack is to delete and recreate the database. If there is a central MLST database on our HPC, then it is problematic if one user's mistakes lead to the pollution of the database which affects all users.

I would like to propose that the AlleleCall step produces something like diff or patch files. I would also like to propose an additional step that can accept a patch file to update the database. The most efficient way to accept a patch might be through git commands but that is just a suggestion.

Having patch files might also be helpful for compatibility with any current or future MLST callers like STing, if they decide to accept patches. It would also help in communicating between labs using ChewBBACA. For example, if I discover a new allele, it would be a standardized approach to communicating it to chewbbaca.online.

Thank you for your consideration on this topic.

lskatz commented 3 years ago

The standard patch format: https://www.oreilly.com/library/view/git-pocket-guide/9781449327507/ch11.html

ramirma commented 3 years ago

Thanks for the suggestions @lskatz . Some of the points you raised have been in discussion in the group for some time, so your comments are an excellent starting point to think more seriously about this. I see @rfm-targa has already self-assigned this. I would just like to highlight that the communication with chewie name server at chewbbaca.online is already automated in chewBBACA, including the submission of new alleles identified for the first time locally. You can see more on this at https://chewie-ns.readthedocs.io/en/latest/user/synchronize_api.html.

lskatz commented 3 years ago

Thank you @ramirma and @rfm-targa for having already thought about this! Thank you for considering this topic!

lskatz commented 1 year ago

Hi, has all this been fixed in version 3?

rfm-targa commented 1 year ago

Hello @lskatz! We've added the --no-inferred parameter to allow users to decide if they want to add novel alleles to the schemas. If you use that parameter, chewBBACA will still classify novel alleles but will not add them to the schema (intermediate files are created in a separate directory). This should help prevent database pollution. Since it does not add novel alleles to the schema if you pass the --no-inferred parameter, it should also be possible to perform allele calling if the schema is read-only. Except for the first time you use a schema to perform allele calling (created with chewBBACA v3 or schemas from chewBBACA <= 2.8.5). chewBBACA v3 creates files with pre-computed values that are used to speedup execution. After the first AlleleCall execution, you can use/copy the schema and use it in read-only mode with the --no-inferred parameter. It only updates the pre-computed files when novel alleles are added to the schema. Let us know if you run the latest version and if any of these issues are not fixed. We'll gladly add changes to make it work under both scenarios you've described.

ramirma commented 1 year ago

@lskatz , I hope @rfm-targa's answer clarifies the points you raised. Also please note that chewBBACA may now run in 3 different modes that may also be of use to you. For more information on this please have a look at the documentation. Do let us know if the solutions implemented fully address the issues you raised.

B-UMMI / chewBBACA

PROPOSAL: diffs #103