Download cRAP - Githubissues

csdaw commented 3 years ago

Adds the following functions (with appropriate tests and examples):

download_crap() wrapper function to download different types of contaminants databases. For now the only type available is type = "ccp" which will call the download_ccp_crap() function. Easily extendable in the future.
download_ccp_crap(), an internal function that will download an up-to-date version of the CCP cRAP database from the latest UniProt release
check_uniprot_release() just returns the latest UniProt release as a character e.g. "2021_01". I use this function when naming files I've downloaded
sub_crap() just adds cRAP numbers to a character vector e.g. turns sp|XXX|YYY into sp|cRAP001|XXX|YYY

Adds the following vignette:

crap.Rmd = brief discussion of contaminant databases, then shows how to use download_crap() and also how to use Biostrings to add your own sequences of interest to a cRAP FASTA file.

csdaw commented 3 years ago

This PR now adds the following functions:

make_crap_fasta() which takes a character vector of UniProt accessions and a file path. It queries UniProt for the sequences and saves them into a fasta at the specified file path (with appropriate cRAP00X numbers if specified).
append_crap_fasta() which takes 2 file paths: one of the fasta to add, and another of the existing cRAP fasta to append to. It adds the sequences in the first fasta to the end of the second (with appropriate cRAP00X numbers if specified).
get_ccp_crap() which does not take any inputs and outputs a character vector of CCP cRAP UniProt sequences.
download_ccp_crap() wrapper function, input is a file path and the function downloads some sequences and saves it to the specified file path. Wraps get_ccp_crap(), make_crap_fasta(), and append_crap_fasta() in that order.
- check_uniprot_release() just returns the latest UniProt release as a character e.g. "2021_01". I use this function when naming files I've downloaded
sub_crap() just adds cRAP numbers to a character vector e.g. turns sp|XXX|YYY into sp|cRAP001|XXX|YYY

Adds the following vignette:

crap.Rmd = brief discussion of contaminant databases, then shows how to use download_ccp_crap(), make_crap_fasta(), and append_crap_fasta().

TomSmithCGAT commented 3 years ago

Brilliant. If I may continue my suggestions, make_crap_fasta) can be used to generate a fasta for any set of accessions, right? Would this be better called make_fasta(), with argument is_crap replacing add_crap to specify that the entry names should be renamed to reflect their crappy nature. Ditto append_crap_fasta().

This would make it easy to add future functions to identify accessions for e.g all human swissProt, from which a fasta file could be generated using make_fasta(), which may also later need an argument to reformat fasta entries to make them the expect format for e.g PD fasta parsing.

csdaw commented 3 years ago

Yes is_crap makes sense to me!

I don't quite understand your second point though.

TomSmithCGAT commented 3 years ago

Right now, your functions are written with crap proteins in mind. But with is_crap=F, they are applicable to any set of uniprot accessions. By making that explicit in the function names and argument, it's clearer that they can also be used if one wants to make a bespoke reference fasta e.g all human SwissProt proteins + transgenes etc

csdaw commented 3 years ago

Ah I see what you mean.

I'll need to make it explicit in the make_fasta() docs that it can only be used for so many accessions at once time. To quote UniProt:

Very large mapping requests (>50,000 identifiers) are likely to fail. Please do verify that your list does not contain any duplicates, and try to split it into smaller chunks (<20,000) in case of problems.

Also if the list of accessions is very long it will take a very long time to make the query.

Probably it would be good to write a function called download_proteome(id = "UP000000xxx", isoforms = FALSE, swissprot_only = TRUE) which would use a different httr query to download UniProt's already put together FASTA files which exist for many (but not all) species. I have a script that could be adapted for this.

TomSmithCGAT commented 3 years ago

Ah, yes, good points. Should definitely download reference fastas where they are available, as you suggest.

For now, I think it's fine just to rename functions & arguments so the above can be merged.

csdaw commented 3 years ago

Alright I think this is it.

This PR now adds the following functions:

make_fasta() which takes a character vector of UniProt accessions and a file path. It queries UniProt for the sequences and saves them into a fasta at the specified file path (with appropriate cRAP00X numbers if is_crap = TRUE).
append_fasta() which takes 2 file paths: file1 = the fasta to append and file2 = the fasta to append to. It adds the sequences in the file1 (with appropriate cRAP00X numbers if is_crap = TRUE) to the end of file2 .
get_ccp_crap() which does not take any inputs and outputs a character vector of CCP cRAP UniProt sequences.
download_ccp_crap() wrapper function, input is a file path and the function downloads some sequences and saves it to the specified file path. Wraps get_ccp_crap(), make_fasta(), and append_fasta() in that order.
check_uniprot_release() just returns the latest UniProt release as a character e.g. "2021_01". I use this function when naming files I've downloaded
sub_crap() just adds cRAP numbers to a character vector e.g. turns sp|XXX|YYY into sp|cRAP001|XXX|YYY

Adds the following vignette:

crap.Rmd = brief discussion of contaminant databases, then shows how to use download_ccp_crap(), make_fasta(), and append_fasta().

TomSmithCGAT commented 3 years ago

Great. Nothing more from me :sweat_smile:

CambridgeCentreForProteomics / camprotR

Download cRAP #21