This repository contains a reciprocal BLAST program for filtering down BLAST results to best bidirectional hits. It also contains a toolkit for finding and visualizing BLAST hits for gene clusters within multiple bacterial genomes.
Some miscellaneous feature requests or enhancements.
[x] Have BackBLAST.py remove tempQuery.faa when finished
[ ] Get rid of the extra Rplots.pdf file generated when generate_BackBLAST_heatmap.R is run
[x] Add support for midpoint rooting in generate_BackBLAST_heatmap.R
[ ] Add quotes around variables in BackBLAST.sh to improve support for whitespaces
[ ] Optionally specify gene/genome metadata files in 'auto' mode of BackBLAST.sh
[ ] Optionally specify plot dimensions in 'auto' mode of BackBLAST.sh
[ ] Remove some flags in 'setup' mode of BackBLAST.sh for clarity
[ ] Change the GToTree rule's hard link to a soft link for consistency with other Snakemake rules
[x] Confirm all Python scripts are Python 3 compatible
[ ] Consider adding a flag to skip reciprocal BLASTP and just run BLASP if the user desires
[ ] Change flags to feel a bit more like running BLAST+ to help the user
[ ] Add support for gzipped FAA input files
[ ] Add symlink support for subject genomes (generate_run_templates.shfind command) and make a meaningful error if not subjects are found
Edge cases for generate_BackBLAST_heatmap.R:
[ ] Check that a genome is not eliminated from the heatmap when genes are removed because missing in the gene metadata file. Check tree tips correspond to heatmap y-axis labels immediately before plotting one last time to be defensive.
[ ] Check if there are multiple queries with the same ID before collapsing into a wide table. Warn the user and randomly pick one, or error out.
[ ] Check if two tree tips have the same name before midpoint rooting, and error out or warn and give unique numerical suffixes
More labour-intensive additions that could be helpful:
[ ] Have BackBLAST output FastA files with the sequences of the detected genes. Consider also optionally aligning them or even using them to make unrooted trees.
[ ] Add an option to summarize the top x fwd/rev hits of each reciprocal BLAST search for debugging purposes. This could be in a single long table format with a column for whether the BLAST is forward or reverse. Note that the reverse search would only be for the top forward search, though, so maybe it would be more accurate to put reverse BLAST searches in their own table??? Not sure. Consider also flagging when BackBLAST "just barely" misses a gene and warning the user.
[ ] Consider adding an option to just run forward BLAST only (some users might find this handy for some reason). However, make sure to warn users that this is generally not advisable.
[ ] Add the ability to have multiple query sets and query genomes.
Documentation:
[ ] Explain the issue of having multiple gene copies in the query reference genome
[ ] Give common commands people might run. For example, using the --until flag to just go up to the BLAST table. (This could be combined with other commands as a workaround for only being able to add one query genome per run, for example.)
[ ] Give a manual of the sub-commands of BackBLAST
Some miscellaneous feature requests or enhancements.
BackBLAST.py
removetempQuery.faa
when finishedRplots.pdf
file generated whengenerate_BackBLAST_heatmap.R
is rungenerate_BackBLAST_heatmap.R
BackBLAST.sh
to improve support for whitespacesBackBLAST.sh
BackBLAST.sh
BackBLAST.sh
for claritygenerate_run_templates.sh
find
command) and make a meaningful error if not subjects are foundEdge cases for
generate_BackBLAST_heatmap.R
:More labour-intensive additions that could be helpful:
Documentation:
--until
flag to just go up to the BLAST table. (This could be combined with other commands as a workaround for only being able to add one query genome per run, for example.)