CambridgeCentreForProteomics / camprotR

https://cambridgecentreforproteomics.github.io/camprotR/
MIT License
4 stars 0 forks source link

Download cRAP #21

Closed csdaw closed 3 years ago

csdaw commented 3 years ago

Adds the following functions (with appropriate tests and examples):

Adds the following vignette:

csdaw commented 3 years ago

This PR now adds the following functions:

Adds the following vignette:

TomSmithCGAT commented 3 years ago

Brilliant. If I may continue my suggestions, make_crap_fasta) can be used to generate a fasta for any set of accessions, right? Would this be better called make_fasta(), with argument is_crap replacing add_crap to specify that the entry names should be renamed to reflect their crappy nature. Ditto append_crap_fasta().

This would make it easy to add future functions to identify accessions for e.g all human swissProt, from which a fasta file could be generated using make_fasta(), which may also later need an argument to reformat fasta entries to make them the expect format for e.g PD fasta parsing.

csdaw commented 3 years ago

Yes is_crap makes sense to me!

I don't quite understand your second point though.

TomSmithCGAT commented 3 years ago

Right now, your functions are written with crap proteins in mind. But with is_crap=F, they are applicable to any set of uniprot accessions. By making that explicit in the function names and argument, it's clearer that they can also be used if one wants to make a bespoke reference fasta e.g all human SwissProt proteins + transgenes etc

csdaw commented 3 years ago

Ah I see what you mean.

I'll need to make it explicit in the make_fasta() docs that it can only be used for so many accessions at once time. To quote UniProt:

Very large mapping requests (>50,000 identifiers) are likely to fail. Please do verify that your list does not contain any duplicates, and try to split it into smaller chunks (<20,000) in case of problems.

Also if the list of accessions is very long it will take a very long time to make the query.

Probably it would be good to write a function called download_proteome(id = "UP000000xxx", isoforms = FALSE, swissprot_only = TRUE) which would use a different httr query to download UniProt's already put together FASTA files which exist for many (but not all) species. I have a script that could be adapted for this.

TomSmithCGAT commented 3 years ago

Ah, yes, good points. Should definitely download reference fastas where they are available, as you suggest.

For now, I think it's fine just to rename functions & arguments so the above can be merged.

csdaw commented 3 years ago

Alright I think this is it.

This PR now adds the following functions:

Adds the following vignette:

TomSmithCGAT commented 3 years ago

Great. Nothing more from me :sweat_smile: