Closed fmalmeida closed 2 years ago
Already created a script to download the genbank of genes from the NCBI Protein database given a list of IDs and them convert this gbk to a well-formatted fasta database to be used by the pipeline as a custom protein database.
This issue will be handled in branch issue-31.
To decide:
--custom_db
or with the current implementation using NCBI Protein accessions --ncbi_proteins
must or not be included in the final GFF?
Additional_database={filename,NCBI_Proteins};{DB_NAME}_product=.....;{DB_NAME}_description=.....
?I believe I am leaning towards the option 2, to keep things more standardized and easier to maintain and track.
However, one thing has been observed: The parameter --ncbi_proteins
loads a protein FASTA while the --custom_db
expects a nucl FASTA. This may cause confusion. To make things cleaner, it would be best with --custom_db
accepts either prot or nucl FASTA, being able to automatically detect the input type and select between BLASTn
or BLASTp
The main tasks that should be accomplished before the issue is done will be always hold in the first comment (even if it requires to be updated).
Almost ready!
Now must work on the report file and the intersection table generation.
While working on a good way to bring modules together putting it in a unique module that automatically understand the inputs, we saw that it would required to perform little changes in how the databases are formatted and download which would also require changes in the docker images.
Therefore, instead of going on with this in its separate branch trying to make it available as soon as possible, this feature will be implemented together with issue #36 in branch remodeling. This would allow that everything is customized in a single intake, avoiding creating a new Docker image that would suffer drastic changes between two releases.
Thus, the branch issue-31 now is available only to as backup to take copy the code that has already been developed for this issue, but its development will go on in #44.
Add a new module which interprets a configuration provided by the user in order to annotate the input, using a list of desired genes from a reference. Ideally it would:
--custom_db
) so the module either download sequences from NCBI and format it for custom annotation or use user's pre-formated database (in FASTA).Anything else?