Orin-beep / HOTSPOT

HOTSPOT is a hierarchical host prediction tool for plasmid contigs using Transformer.
9 stars 1 forks source link

Request for Guidance on Mapping Dictionary Implementation in Your Tool #5

Open zzzfire opened 1 week ago

zzzfire commented 1 week ago

Hello, Author!

Your tool is incredibly effective, and I truly appreciate your work. However, I need your assistance with a specific issue.

In the database section, your model can map and convert the extracted pc, mob, mpf, and inc into encoded formats. I’d like to ask:

How did you construct the mapping dictionary? Would it be possible for you to share the relevant code? I would greatly appreciate your help! Thank you in advance for your time and support.

Orin-beep commented 1 week ago

Hi, there. Thanks for your question and for using our tool!

The code for constructing the features (PC, MOB, MPF, and Inc) in encoded formats can be found from line 120 to line 232 in the "preprocessing.py" script.

Specifically, the final features for each plasmid will be encoded into a vector with length 472, where the first 400 elements are for PC tokens, the following 50 elements for MOB/MPF tokens, and the last 22 elements for Inc encoding in the one-hot format.

The PC tokens are determined by the protein cluster index. The mapping for MOB/MPF token is: {'MOBB':4, 'MOBQ':7, 'MOBP':8, 'MOBM':5, 'MOBF':1, 'MOBT':2, 'MOBC':3, 'MOBH':6, 'MOBV':9, 'MPF_G':10, 'MPF_T':11, 'MPF_F':12, 'MPF_I':13}

And the Inc one-hot encoding index is ['IncA/C', 'IncY', 'IncP', 'IncHI1', 'Inc4', 'IncHI2', 'IncI/B/O/K/Z', 'IncT', 'IncQ', 'Inc11', 'Inc13', 'IncN', 'IncFIC', 'FII', 'IncU', 'IncW', 'IncX', 'IncL/M', 'IncR', 'FIA', 'IncFIB', 'Inc18']. For example, if a plasmid only has Inc group IncA/C, then its Inc encoding is [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

Feel free to discuss with me if there is any other questions!