clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Remove duplicate entries from Bioresources #746

Closed enoriega closed 3 years ago

enoriega commented 3 years ago

I am about to remove the duplicate entry for E3:

['E3', '5756', '', 'pubchem', 'Simple_chemical'],
['E3', 'E3_Ub_ligase', '', 'fplx', 'Family']]

@MihaiSurdeanu Which one is the correct entry? Also, I detected more entities with duplicate but identical entries in NER-Grounding-Override.tsv. I will remove the redundant ones.

Now that I am on this, should I look for duplicate entries in the other KB files?

MihaiSurdeanu commented 3 years ago

Not sure. @bgyori ?

You should check the override KB for duplicates. Probably not the other ones. Thanks!

kwalcock commented 3 years ago

See also the list at the bottom of #742 .

enoriega commented 3 years ago

Thanks @kwalcock . I am not sure of how to deal with the other duplicates. Consider axin, it appears in uniprot, but has a manual override. Does it make sense to remove it from uniprot? I think it doesn't

bgyori commented 3 years ago

Duplicates in the overrides: interesting, I don't think having duplicates in there make sense so we should probably remove those - if possible I'd like to take a look at the choices to see if they make sense. As for the other files, duplicates at the level of the entity string are normal ambiguities that are to be expected so we shouldn't remove them.

kwalcock commented 3 years ago

I think that they have duplicates within the overrides because I stopped adding to the list before it got to processing the regular KBs. As usual, I may be mistaken.

bgyori commented 3 years ago

Actually, I could work on eliminating the override duplicates and push here, shall I do that?

MihaiSurdeanu commented 3 years ago

Please!

On Tue, Apr 6, 2021 at 2:24 PM Benjamin M. Gyori @.***> wrote:

Actually, I could work on eliminating the override duplicates and push here, shall I do that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/746#issuecomment-814449086, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TQQABONNBH7T42CH7LTHN3YXANCNFSM42PNW26Q .