Closed csdaw closed 3 years ago
This PR now adds the following functions:
make_crap_fasta()
which takes a character vector of UniProt accessions and a file path. It queries UniProt for the sequences and saves them into a fasta at the specified file path (with appropriate cRAP00X
numbers if specified).append_crap_fasta()
which takes 2 file paths: one of the fasta to add, and another of the existing cRAP fasta to append to. It adds the sequences in the first fasta to the end of the second (with appropriate cRAP00X
numbers if specified).get_ccp_crap()
which does not take any inputs and outputs a character vector of CCP cRAP UniProt sequences.download_ccp_crap()
wrapper function, input is a file path and the function downloads some sequences and saves it to the specified file path. Wraps get_ccp_crap()
, make_crap_fasta()
, and append_crap_fasta()
in that order.check_uniprot_release()
just returns the latest UniProt release as a character e.g. "2021_01"
. I use this function when naming files I've downloadedsub_crap()
just adds cRAP numbers to a character vector e.g. turns sp|XXX|YYY
into sp|cRAP001|XXX|YYY
Adds the following vignette:
crap.Rmd
= brief discussion of contaminant databases, then shows how to use download_ccp_crap()
, make_crap_fasta()
, and append_crap_fasta()
.Brilliant. If I may continue my suggestions, make_crap_fasta)
can be used to generate a fasta for any set of accessions, right? Would this be better called make_fasta()
, with argument is_crap
replacing add_crap
to specify that the entry names should be renamed to reflect their crappy nature. Ditto append_crap_fasta()
.
This would make it easy to add future functions to identify accessions for e.g all human swissProt, from which a fasta file could be generated using make_fasta()
, which may also later need an argument to reformat fasta entries to make them the expect format for e.g PD fasta parsing.
Yes is_crap
makes sense to me!
I don't quite understand your second point though.
Right now, your functions are written with crap proteins in mind. But with is_crap=F
, they are applicable to any set of uniprot accessions. By making that explicit in the function names and argument, it's clearer that they can also be used if one wants to make a bespoke reference fasta e.g all human SwissProt proteins + transgenes etc
Ah I see what you mean.
I'll need to make it explicit in the make_fasta()
docs that it can only be used for so many accessions at once time. To quote UniProt:
Very large mapping requests (>50,000 identifiers) are likely to fail. Please do verify that your list does not contain any duplicates, and try to split it into smaller chunks (<20,000) in case of problems.
Also if the list of accessions is very long it will take a very long time to make the query.
Probably it would be good to write a function called download_proteome(id = "UP000000xxx", isoforms = FALSE, swissprot_only = TRUE)
which would use a different httr query to download UniProt's already put together FASTA files which exist for many (but not all) species. I have a script that could be adapted for this.
Ah, yes, good points. Should definitely download reference fastas where they are available, as you suggest.
For now, I think it's fine just to rename functions & arguments so the above can be merged.
Alright I think this is it.
This PR now adds the following functions:
make_fasta()
which takes a character vector of UniProt accessions and a file path. It queries UniProt for the sequences and saves them into a fasta at the specified file path (with appropriate cRAP00X
numbers if is_crap = TRUE
).append_fasta()
which takes 2 file paths: file1 = the fasta to append and file2 = the fasta to append to. It adds the sequences in the file1 (with appropriate cRAP00X
numbers if is_crap = TRUE
) to the end of file2 .get_ccp_crap()
which does not take any inputs and outputs a character vector of CCP cRAP UniProt sequences.download_ccp_crap()
wrapper function, input is a file path and the function downloads some sequences and saves it to the specified file path. Wraps get_ccp_crap()
, make_fasta()
, and append_fasta()
in that order.check_uniprot_release()
just returns the latest UniProt release as a character e.g. "2021_01"
. I use this function when naming files I've downloadedsub_crap()
just adds cRAP numbers to a character vector e.g. turns sp|XXX|YYY
into sp|cRAP001|XXX|YYY
Adds the following vignette:
crap.Rmd
= brief discussion of contaminant databases, then shows how to use download_ccp_crap()
, make_fasta()
, and append_fasta()
.Great. Nothing more from me :sweat_smile:
Adds the following functions (with appropriate tests and examples):
download_crap()
wrapper function to download different types of contaminants databases. For now the only type available istype = "ccp"
which will call thedownload_ccp_crap()
function. Easily extendable in the future.download_ccp_crap()
, an internal function that will download an up-to-date version of the CCP cRAP database from the latest UniProt releasecheck_uniprot_release()
just returns the latest UniProt release as a character e.g."2021_01"
. I use this function when naming files I've downloadedsub_crap()
just adds cRAP numbers to a character vector e.g. turnssp|XXX|YYY
intosp|cRAP001|XXX|YYY
Adds the following vignette:
crap.Rmd
= brief discussion of contaminant databases, then shows how to usedownload_crap()
and also how to useBiostrings
to add your own sequences of interest to a cRAP FASTA file.