Rfam / rfam-production

Rfam production pipeline
Apache License 2.0
5 stars 3 forks source link

Add a new tool for replacing accessions in SEED using NCBI BLAST #71

Open AntonPetrov opened 3 years ago

AntonPetrov commented 3 years ago

If a SEED alignment contains sequences with accessions not found in Genbank or RNAcentral (for example, AGAIN_GXP7IEG01DU42M/31-242) we cannot use such an alignment to build an Rfam family. The idea is to search all sequences with "bad" accessions using NCBI BLAST and find closely matching sequences with valid accessions (CP061297.1/724751-724962).

There does not seem to be an NCBI BLAST API but it's possible to submit multiple sequences using a web interface and then download the results in JSON format using a job id.

I propose the following approach:

The curator can start a BLAST search manually on the NCBI website, get a BLAST-job-id, and then run rfblast BLAST-job-id in a folder with an old SEED to get a new SEED.

Additional info:

I can start putting rfblast together (in Python of course 🐍 ) with some help from @emmaco. We may ask @nawrockie for assistance when needed.

@nancyontiveros - hopefully this will help build Ken families but such a tool will be useful for general curation and Rfam Cloud as well.

nancyontiveros commented 3 years ago

This looks great, finally a way to convert the bad IDs to valid IDs

Although, I see many searches in BLAST that are not 100% identical, which means that we can’t use the Blast result,

I figure that maybe we can come with fewer sequences in that cases, which means a SEED with fewer sequences,

But we may need to test it, hopefully 🤞, this fixes the problem of the SEEDs with bad IDs

Thank you

Anton and Emma

Nancy

On 12 Aug 2021, at 09:56, Anton Petrov @.***> wrote:

If a SEED alignment contains sequences with accessions not found in Genbank or RNAcentral (for example, AGAIN_GXP7IEG01DU42M/31-242) we cannot use such an alignment to build an Rfam family. The idea is to search all sequences with "bad" accessions using NCBI BLAST and find closely matching sequences with valid accessions (CP061297.1/724751-724962).

There does not seem to be an NCBI BLAST API but it's possible to submit multiple sequences using a web interface and then download the results in JSON format using a job id.

I propose the following approach:

convert a user-submitted SEED alignment to a fasta file build a CM based on the SEED manually search NCBI web BLAST and download results in JSON format the new program will automatically pick for each sequence in the fasta file a single “best” hit based on the combination of coverage, identity, scores etc. Or skip if no good results are found. use cmalign to align the “best” selected sequences to the CM use the resulting alignment as a new SEED The curator can start a BLAST search manually on the NCBI website, get a BLAST-job-id, and then run rfblast BLAST-job-id in a folder with an old SEED to get a new SEED.

Additional info:

there is an EBI BLAST+ with a web interface but it seems to be much slower looks like NCBI BLAST used to have a web API https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=DeveloperInfo but it's been discontinued https://ncbi.github.io/blast-cloud/dev/api.html in favour of self-hosted cloud installations. I can start putting rfblast together (in Python of course 🐍 ) with some help from @emmaco https://github.com/emmaco. We may ask @nawrockie https://github.com/nawrockie for assistance when needed.

@nancyontiveros https://github.com/nancyontiveros - hopefully this will help build Ken families but such a tool will be useful for general curation and Rfam Cloud as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rfam/rfam-production/issues/71, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOFRES6JUYVSQRU75I3MN23T4OEFBANCNFSM5CAUDIOA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

AntonPetrov commented 3 years ago

The sequences don't have to always be 100% identical, as long as they are "close enough". We can decide on the thresholds and provide ways to adjust them using command line options. But you are right, @nancyontiveros - sometimes there will be no good hits and those sequences will have to be removed from the SEED 😞