RNAcentral / rnacentral-webcode

RNAcentral website source code
https://rnacentral.org
Apache License 2.0
31 stars 8 forks source link

Detect possible protein coding sequences #116

Open blakesweeney opened 7 years ago

blakesweeney commented 7 years ago

There appears to be protein coding sequences in RNAcentral. For example:

http://rnacentral.org/search?q=%22protein%22%20and%20not%20%22non-protein%22 http://rnacentral.org/search?q=hypothetical%20protein http://rnacentral.org/search?q=open%20reading%20frame

Not all of the results from these searches produce things that are really protein coding. Some of this is may just be naming issue. Such as:

http://rnacentral.org/rna/URS000075B8D6/9606 http://rnacentral.org/rna/URS000075CAAA/9606 http://rnacentral.org/rna/URS000079768F/3702

these have alternative names that are not about protein coding. We should use these over other names. Other examples include:

http://rnacentral.org/rna/URS000075D487/9606

where the title at RefSeq shows: "Homo sapiens long intergenic non-protein coding RNA 1940 (LINC01940), long non-coding RNA" yet the one we see is: "Homo sapiens FLJ43879 protein (FLJ43879), long non-coding RNA." The first is a better name for our purposes.

However, not all hits are such a thing:

http://rnacentral.org/rna/URS0000A77706/10116 http://rnacentral.org/rna/URS000075D487/9606

This issue is likely a mix of issues:

  1. Importing things sequences which are protein coding and we should not import.
  2. Selecting the wrong name for a sequence.
  3. Needing to create a better name than the given ones.

We should probably do something about this issue but this may be a longer term work. I am putting this here for tracking and discussion.

More examples in #73.

AntonPetrov commented 6 years ago

Suggestions from the Consortium meeting:

blakesweeney commented 6 years ago

There were also suggestions about computing coding potential for sequences.