BioinformaticsLabAtMUN / Promotech

Machine-learning-based general bacterial promoter prediction tool.
GNU General Public License v3.0
42 stars 11 forks source link

Understanding Output #17

Closed hkforbio closed 10 months ago

hkforbio commented 10 months ago

Thank you for developing and promoting Promotech.

I was able to run Promotech and got the following result.

CREATING OUTPUT FOLDER: results

PRINTING CONTENT

  1. GENOME: AP009180.1 - LENGTH: 159662

    JOINING ALL CHROMS AND SEQS INTO A SINGLE FOR TETRA-NUCLEOTIDE SLIDING WINDOW

    JOINED GENOME: AP009180.1 - LENGTH: 159,662

    GENERATING PROMOTER SEQUENCES WITH WINDOW-SIZE: 40 AND STEP: 1. EXPECTED SAMPLES: 159,621

100% (159621 of 159621) |################| Elapsed Time: 0:00:00 Time: 0:00:00

 TIME ELAPSED FROM START (HOUR:MIN:SEC): 00:00:00

CUTTED 40 NT SEQUENCES GENERATED SUCCESSFULLY. # OF SAMPLES: 159,621 = (159621,). SAMPLE #1: ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATG SAMPLE #2: TGAATACTATATTTTCAAGAATAACACCATTAGGAAATGG

CONVERTING 159621 CUTTED 40 NT SEQUENCES TO RF-HOT SEQUENCES USING MAPPING VALUES

[{'A': array([1., 0., 0., 0.])}, {'G': array([0., 1., 0., 0.])}, {'C': array([0., 0., 1., 0.])}, {'T': array([0., 0., 0., 1.])}]

CONVERTING DATA 99% (159075 of 159621) |############### | Elapsed Time: 0:00:08 ETA: 0:00:00

HOT ENCODED SEQUENCES GENERATED SUCCESSFULLY.

A G C T A ... T A G C T 0 1 0 0 0 0 ... 1 0 1 0 0 1 0 0 0 1 0 ... 0 0 1 0 0 2 0 1 0 0 1 ... 0 0 0 0 1 3 1 0 0 0 1 ... 1 1 0 0 0 4 1 0 0 0 0 ... 0 0 0 1 0

[5 rows x 160 columns]

 TIME ELAPSED FROM START (HOUR:MIN:SEC): 00:00:10

RF-HOT SEQUENCES GENERATED SUCCESSFULLY. OUTPUT DATAFRAME SHAPE: (159621, 160)

SAMPLE:

A G C T A ... T A G C T 0 1 0 0 0 0 ... 1 0 1 0 0 1 0 0 0 1 0 ... 0 0 1 0 0 2 0 1 0 0 1 ... 0 0 0 0 1 3 1 0 0 0 1 ... 1 1 0 0 0 4 1 0 0 0 0 ... 0 0 0 1 0

[5 rows x 160 columns]

SAVING FORWARD STRAND HOT-ENCODED SEQUENCES TO BINARY FILE USING JOBLIB TO: results/RF-HOT.data

 TIME ELAPSED FROM START (HOUR:MIN:SEC): 00:00:10

FILE SAVED SUCCESSFULLY AT: results/RF-HOT.data

GENERATING INVERSE STRAND SEQUENCES. 100% (159621 of 159621) |################| Elapsed Time: 0:00:00 Time: 0:00:00

INVERSE STRAND SEQUENCES GENERATED SUCCESSFULLY. # OF SAMPLES: 159,621. SAMPLE: ORIGINAL : ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATG INVERSE : CATTTCCTAATGGTGTTATTCTTGAAAATATAGTATTCAT

 TIME ELAPSED FROM START (HOUR:MIN:SEC): 00:00:10

CONVERTING 159621 INVERSE STRAND 40 NT SEQUENCES TO RF-HOT SEQUENCES USING MAPPING VALUES

[{'A': array([1., 0., 0., 0.])}, {'G': array([0., 1., 0., 0.])}, {'C': array([0., 0., 1., 0.])}, {'T': array([0., 0., 0., 1.])}]

CONVERTING INVERSE DATA 99% (158962 of 159621) |################################################################################################################################################################################## | Elapsed Time: 0:00:08 ETA: 0:00:00

HOT ENCODED SEQUENCES GENERATED SUCCESSFULLY.

A G C T A ... T A G C T 0 0 0 1 0 1 ... 0 0 0 0 1 1 0 0 1 0 0 ... 0 1 0 0 0 2 1 0 0 0 0 ... 1 0 0 1 0 3 0 0 0 1 1 ... 1 0 0 0 1 4 0 1 0 0 0 ... 0 0 0 0 1

[5 rows x 160 columns]

 TIME ELAPSED FROM START (HOUR:MIN:SEC): 00:00:20

RF-HOT SEQUENCES GENERATED SUCCESSFULLY. OUTPUT DATAFRAME SHAPE: (159621, 160)

SAMPLE: A G C T A ... T A G C T 0 0 0 1 0 1 ... 0 0 0 0 1 1 0 0 1 0 0 ... 0 1 0 0 0 2 1 0 0 0 0 ... 1 0 0 1 0 3 0 0 0 1 1 ... 1 0 0 0 1 4 0 1 0 0 0 ... 0 0 0 0 1

[5 rows x 160 columns]

SAVING INVERSE STRAND SEQUENCES TO BINARY FILE USING JOBLIB TO: results/RF-HOT-INV.data

 TIME ELAPSED FROM START (HOUR:MIN:SEC): 00:00:20

FILE SAVED SUCCESSFULLY AT: results/RF-HOT-INV.data

The program itself runs properly and I was able to locate RF-HOT.data and RF-HOT-INV.data in my results file. However, the results are in binary and is encoded. How do I open up the RF-HOT.data and RF-HOT-INV.data file and interpret these results further? It would be ideal if I can know the promoter sequences and locations as well as store the results in a Pandas dataframe for further analysis.

Thanks in advance!