bdaisley / isolateR

Automated processing of Sanger sequencing data, taxonomic profiling, and generation of microbial strain libraries
Other
10 stars 1 forks source link

FASTA as input for `isoTAX`? #9

Closed lxsteiner closed 3 months ago

lxsteiner commented 4 months ago

Hi,

This looks like such a useful wrapper for handling large collections of Sanger sequences, thank you for publishing it!

I've read the tutorial and manual, but was wondering if it would be possible to pass entries made only from FASTA sequences (and not .ab1 files) as an output of isoQC into isoTAX? Or what could a possible workaround be to still input samples where only FASTA sequences exist (e.g. make up mock ab1 quality values, make them into .ab1 files, and process it in isoQC)?

The motivation being, that in-house we of course have .ab1 files from which FASTA sequences were eventually extracted and worked with for tax. identification and etc. But in order to have sequences from other labs used in the same collection/pipeline (e.g. for taxonomic identity), only FASTA sequences are usually available and made public.

It would be great if this were possible all within isolateR, otherwise it's again a chore to process own samples with .ab1 files here, make collections, export FASTA, add external FASTA collections, redo taxonomic identifications with whatever tool, summarize taxonomy on your own.

Do you see any possible workaround at the moment or possibly implementing a similar feature in the future?

Thanks.

bdaisley commented 4 months ago

@lxsteiner - Thanks for the feedback, this is a great idea and very much doable. I will add a proper feature within the next week. For an immediate workaround, the mock up ab1 quality values as you mentioned would work. Just add your sequences in an exisiting isoQC formatted file, then add mock data for missing columns, and you should be able to continue onward to the isoTAX > isoLIB steps as usual.

bdaisley commented 3 months ago

Hi @lxsteiner, just following up on this. I've adjusted the isoTAX function in the latest isolateR package release to allow for input of FASTA files, as requested. Brief overview as follows:

Example walkthrough

Update to latest version of isolateR

if ("package:isolateR" %in% search()) {detach("package:isolateR", unload=TRUE)}
devtools::install_github("bdaisley/isolateR")

Example case using FASTA file containing 16S rRNA genes from human gut isolates

Manual download link for FASTA example: human_gut_isolates_10.fasta

#Download example FASTA file:
download.file("https://github.com/bdaisley/isolateR/raw/main/inst/extdata/fasta_examples/human_gut_isolates_10.fasta", 
              destfile="T:/human_gut_isolates_10.fasta")

#Run isoTAX with FASTA file as input (Note: 'quick_search=FALSE' recommended for real use scenario)
isoTAX(input="T:/human_gut_isolates_10.fasta", quick_search=TRUE)

The above commands will generate the following output files:

Optional: Use mock isoQC table as input to isoTAX instead

Manual changes can be incorporated into the mock isoQC table and then re-run with isoTAX. This may be desirable if you want to add custom quality values or other metadata not directly accessible from a raw FASTA file.

isoTAX(input="T:/isolateR_output/01_isoQC_mock_table.csv", quick_search=TRUE)

If nothing was edited in the isoQC mock table, this last line of code will functionally lead to the same output as with using the FASTA file directly.

I hope these additions are helpful. Please let me know if any further adjustments are needed!

lxsteiner commented 2 months ago

Hi @bdaisley that's brilliant! Thank you for following up on my suggestion, incorporating it in isoTAX and providing the short example! I've just tried it and could reproduce everything. I find this incredibly useful for my use case.

Short feedback; the input argument only works/accepts absolute file paths (at least on Windows-RStudio). Relative file paths "./FILE.fasta", or just providing the input as a file in the working directory "FILE.fasta" don't work with how you currently parse and define paths in L80-83 and 126-131 and throws errors:

Error in setwd(path) : cannot change working directory

or

Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
  cannot open file './isolateR_output/01_isoQC_mock_table.csv': No such file or directory

Might help to point that out in the documentation if possible if there's no workaround to accept any definition. Anticipating all possible formulations is annoying work (I know).

I'll keep on playing with the output. We have some strains that have very questionable taxonomic assignments with RefSeq, curious to see how these will be handled in isoTAX. Will get back with that eventually.

PS. I've noticed that in the documentation at https://github.com/bdaisley/isolateR?tab=readme-ov-file#step-2-isotax---assign-taxonomy that all taxonomy values are equal to the defaults except the genus cut-off (genus_threshold=96.5) which is different than what is specified in the help files. Was this on purpose just for that example or does it reflect a preference? From within the R documentation:

Returns taxonomic classification table of class isoTAX. Default taxonomic cutoffs for phylum (75.0), class (78.5), order (82.0), family (86.5), genus (94.5), and species (98.7) demarcation are based on Yarza et al. 2014, Nature Reviews Microbiology (DOI:10.1038/nrmicro3330)

Thanks!