annotation using list of gene IDs

fmalmeida commented 2 years ago

Add a new module which interprets a configuration provided by the user in order to annotate the input, using a list of desired genes from a reference. Ideally it would:

[x] Download the sequences selected by user
[x] Format these sequences to be used by blastp and generate reports properly
[x] Use this FASTA subset to annotate the sample and create various reports, such as:
- [x] A table of the anntotation: genes found, % ID, % Coverage, aln length, coordinates, etc.
- [ ] Does this annotation intersect with any gene detected by Prokka? The annotation between them differ? Create a table comparing it.
- [ ] generate the final HTML report for this custom annotation
[x] Integrate this option with the already implemented custom database analysis using user's pre-formatted FASTAs (--custom_db) so the module either download sequences from NCBI and format it for custom annotation or use user's pre-formated database (in FASTA).
- [x] Does it automaticaly detects between prot and nucl?

Anything else?

fmalmeida commented 2 years ago

Already created a script to download the genbank of genes from the NCBI Protein database given a list of IDs and them convert this gbk to a well-formatted fasta database to be used by the pipeline as a custom protein database.

fmalmeida commented 2 years ago

This issue will be handled in branch issue-31.

fmalmeida commented 2 years ago

To decide:

These custom annotations with the already implemented --custom_db or with the current implementation using NCBI Protein accessions --ncbi_proteins must or not be included in the final GFF?
- If yes, how should it be added? Additional_database={filename,NCBI_Proteins};{DB_NAME}_product=.....;{DB_NAME}_description=.....?
Or should it be only given as an additional result to the main pipeline, instead of participating as a main tool and inside the main result?
- If yes, the final report of this custom annotation would have to have:
  - A table containing its intersection with the main annotation. And this table should provide how the gene was annotated by Prokka and how it was annotated (and how the alignment is) using the custom database.

I believe I am leaning towards the option 2, to keep things more standardized and easier to maintain and track.

However, one thing has been observed: The parameter --ncbi_proteins loads a protein FASTA while the --custom_db expects a nucl FASTA. This may cause confusion. To make things cleaner, it would be best with --custom_db accepts either prot or nucl FASTA, being able to automatically detect the input type and select between BLASTn or BLASTp

The main tasks that should be accomplished before the issue is done will be always hold in the first comment (even if it requires to be updated).

fmalmeida commented 2 years ago

Almost ready!

Now must work on the report file and the intersection table generation.

fmalmeida commented 2 years ago

While working on a good way to bring modules together putting it in a unique module that automatically understand the inputs, we saw that it would required to perform little changes in how the databases are formatted and download which would also require changes in the docker images.

Therefore, instead of going on with this in its separate branch trying to make it available as soon as possible, this feature will be implemented together with issue #36 in branch remodeling. This would allow that everything is customized in a single intake, avoiding creating a new Docker image that would suffer drastic changes between two releases.

Thus, the branch issue-31 now is available only to as backup to take copy the code that has already been developed for this issue, but its development will go on in #44.

fmalmeida / bacannot

annotation using list of gene IDs #31