Request for Guidance on Mapping Dictionary Implementation in Your Tool

Hi, there. Thanks for your question and for using our tool!

The code for constructing the features (PC, MOB, MPF, and Inc) in encoded formats can be found from line 120 to line 232 in the "preprocessing.py" script.

Specifically, the final features for each plasmid will be encoded into a vector with length 472, where the first 400 elements are for PC tokens, the following 50 elements for MOB/MPF tokens, and the last 22 elements for Inc encoding in the one-hot format.

The PC tokens are determined by the protein cluster index. The mapping for MOB/MPF token is: {'MOBB':4, 'MOBQ':7, 'MOBP':8, 'MOBM':5, 'MOBF':1, 'MOBT':2, 'MOBC':3, 'MOBH':6, 'MOBV':9, 'MPF_G':10, 'MPF_T':11, 'MPF_F':12, 'MPF_I':13}

And the Inc one-hot encoding index is ['IncA/C', 'IncY', 'IncP', 'IncHI1', 'Inc4', 'IncHI2', 'IncI/B/O/K/Z', 'IncT', 'IncQ', 'Inc11', 'Inc13', 'IncN', 'IncFIC', 'FII', 'IncU', 'IncW', 'IncX', 'IncL/M', 'IncR', 'FIA', 'IncFIB', 'Inc18']. For example, if a plasmid only has Inc group IncA/C, then its Inc encoding is [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

Feel free to discuss with me if there is any other questions!

Orin-beep / HOTSPOT

Request for Guidance on Mapping Dictionary Implementation in Your Tool #5