WormBase / genedesc_generator

Automated gene descriptions generator for model organism databases
Other
1 stars 0 forks source link

Duplicates in protein domain sentences #61

Open rankishore opened 1 year ago

rankishore commented 1 year ago

There seems to be duplicate data in the protein domain sentences, noticed for WS290, seen in WS289 as well. Examples:

  1. WB:WBGene00001034 dnj-16 Is predicted to encode a protein with the following domains: DnaJ domain; DnaJ domain; and Chaperone J-domain superfamily.
  2. WB:WBGene00001514 xnd-1 Is predicted to encode a protein with the following domains: Phosphorylation site and Phosphorylation site.
  3. WB:WBGene00001693 grd-4 Is predicted to encode a protein with the following domains: Ground-like domain and Ground-like domain.

If this is a source file issue, then we should create a rule that eliminates the duplicate data. Seen in other species files in addition to the above C. elegans examples.

rankishore commented 1 year ago

@valearna Looks like this problem as been fixed in the new file 20230707_c_elegans.txt. I don't see duplicate protein domains anymore.

valearna commented 12 months ago

@rankishore can we close this one as well?