Open zzzfire opened 1 week ago
Hi, there. Thanks for your question and for using our tool!
The code for constructing the features (PC, MOB, MPF, and Inc) in encoded formats can be found from line 120 to line 232 in the "preprocessing.py" script.
Specifically, the final features for each plasmid will be encoded into a vector with length 472, where the first 400 elements are for PC tokens, the following 50 elements for MOB/MPF tokens, and the last 22 elements for Inc encoding in the one-hot format.
The PC tokens are determined by the protein cluster index. The mapping for MOB/MPF token is: {'MOBB':4, 'MOBQ':7, 'MOBP':8, 'MOBM':5, 'MOBF':1, 'MOBT':2, 'MOBC':3, 'MOBH':6, 'MOBV':9, 'MPF_G':10, 'MPF_T':11, 'MPF_F':12, 'MPF_I':13}
And the Inc one-hot encoding index is ['IncA/C', 'IncY', 'IncP', 'IncHI1', 'Inc4', 'IncHI2', 'IncI/B/O/K/Z', 'IncT', 'IncQ', 'Inc11', 'Inc13', 'IncN', 'IncFIC', 'FII', 'IncU', 'IncW', 'IncX', 'IncL/M', 'IncR', 'FIA', 'IncFIB', 'Inc18']. For example, if a plasmid only has Inc group IncA/C, then its Inc encoding is [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
Feel free to discuss with me if there is any other questions!
Hello, Author!
Your tool is incredibly effective, and I truly appreciate your work. However, I need your assistance with a specific issue.
In the database section, your model can map and convert the extracted pc, mob, mpf, and inc into encoded formats. I’d like to ask:
How did you construct the mapping dictionary? Would it be possible for you to share the relevant code? I would greatly appreciate your help! Thank you in advance for your time and support.