ehsanasgari / Deep-Proteomics

51 stars 17 forks source link

Missing vectors #2

Open peter-volkov opened 7 years ago

peter-volkov commented 7 years ago

Why the vectors for 3grams: 'HDU', 'USC', 'XLC', 'GUS', 'UKK', 'XNM', 'TGU', 'UAR', 'KBF', 'CUI', 'UGS', 'WXV', 'WCU', 'DUA', 'FXQ', 'SCU', 'XTC', 'URS', 'WRX', 'VWX', 'XGH', 'SSU', 'DKB' are missing? They can be found in proteins from SwissProt collections, though not very frequent: {'WXV': 4, 'VWX': 4, 'HDU': 2, 'XLC': 2, 'UAR': 2, 'KBF': 2, 'DUA': 2, 'DKB': 2, 'USC': 1, 'GUS': 1, 'UKK': 1, 'XNM': 1, 'TGU': 1, 'CUI': 1, 'UGS': 1, 'WCU': 1, 'FXQ': 1, 'SCU': 1, 'XTC': 1, 'URS': 1, 'WRX': 1, 'XGH': 1, 'SSU': 1}

peter-volkov commented 7 years ago

Perhaps they are found only in new SwissProt proteins, that were added since your publication

ehsanasgari commented 7 years ago

As you figured it out, they are rare 3-grams which are considered as unkown in representation learning. We are going to publish a new version which covers those too. One approach would be considering the most similar 3-gram in terms of alignment score.