LiLabAtVT / DeepTE

Neural network classification of TE
BSD 3-Clause "New" or "Revised" License
78 stars 7 forks source link

DeepTE classification differs from that in the curated library #30

Open sahoo-rk opened 8 months ago

sahoo-rk commented 8 months ago

Hello: Thanks for developing this interesting tool. While I was trying to re-classify the TE library of an insect species, I observe certain level of discrepancy between curated/denovo classification and that from the DeepTE. Below mentioned the snapshot of such occurrences. In the Case 1, as the library was derived from the curated databases, dfam/sinebase, I was expecting the DeepTE to identify the same classification. In contrast, DeepTE provides a complete different or a higher-level classification. The scenario is similar in Case 2 for denovo predictions. How to proceed in such instances? Please suggest.

Case 1: sinebase#SINE/Unknown ClassI dfam#LINE/Unknown ClassI dfam#LINE/Unknown ClassI_LTR_Gypsy

Case 2: TE_00003106_INT#LTR/Gypsy ClassI TE_00004093_INT#LTR/Copia ClassI TE_00003851_LTR#LTR/Gypsy unknown

NB: DeepTE was executed with the supplied metazoan model and the classification of the TE library of 10K sequences was completed in 6mins only. Best,

songliVT commented 8 months ago

Hi Haidong,

What would you suggest in this case?

Song

On Tue, Nov 21, 2023, 1:05 AM Ranjit Kumar Sahoo @.***> wrote:

Hello: Thanks for developing this interesting tool. While I was trying to re-classify the TE library of an insect species, I observe certain level of discrepancy between curated/denovo classification and that from the DeepTE. Below mentioned the snapshot of such occurrences. In the Case 1, as the library was derived from the curated databases, dfam/sinebase, I was expecting the DeepTE to identify the same classification. In contrast, DeepTE provides a complete different or a higher-level classification. The scenario is similar in Case 2 for denovo predictions. How to proceed in such instances? Please suggest.

Case 1: sinebase#SINE/Unknown ClassI dfam#LINE/Unknown ClassI dfam#LINE/Unknown ClassI_LTR_Gypsy

Case 2: TE_00003106_INT#LTR/Gypsy ClassI TE_00004093_INT#LTR/Copia ClassI TE_00003851_LTR#LTR/Gypsy unknown

NB: DeepTE was executed with the supplied metazoan model and the classification of the TE library of 10K sequences was completed in 6mins only. Best,

— Reply to this email directly, view it on GitHub https://github.com/LiLabAtVT/DeepTE/issues/30, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEEENQTXPKQILG66A3WBZLYFRADFAVCNFSM6AAAAAA7UACCYGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDGNJSHA4DGNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

songliVT commented 8 months ago

See Haidong's reply (he is the first author of the paper and he moved to a different position):

DeepTE may not perform 100% perfect for the classification due to incomplete training materials or sort of overfitting. For example, the initial training materials I used may not cover the curated databases you mentioned, which may allow the models cannot learn the patterns from the databases. Would you mind testing how often these cases occurred? I guess most of the curated classifications would be captured. A seondary choice is that you could use the 'training_example_dir' in the DeepTE github to have a new training based on the new curated databases, which may help to solve this issue.

Best wishes, Haidong

On Tue, Nov 21, 2023 at 1:05 AM Ranjit Kumar Sahoo @.***> wrote:

Hello: Thanks for developing this interesting tool. While I was trying to re-classify the TE library of an insect species, I observe certain level of discrepancy between curated/denovo classification and that from the DeepTE. Below mentioned the snapshot of such occurrences. In the Case 1, as the library was derived from the curated databases, dfam/sinebase, I was expecting the DeepTE to identify the same classification. In contrast, DeepTE provides a complete different or a higher-level classification. The scenario is similar in Case 2 for denovo predictions. How to proceed in such instances? Please suggest.

Case 1: sinebase#SINE/Unknown ClassI dfam#LINE/Unknown ClassI dfam#LINE/Unknown ClassI_LTR_Gypsy

Case 2: TE_00003106_INT#LTR/Gypsy ClassI TE_00004093_INT#LTR/Copia ClassI TE_00003851_LTR#LTR/Gypsy unknown

NB: DeepTE was executed with the supplied metazoan model and the classification of the TE library of 10K sequences was completed in 6mins only. Best,

— Reply to this email directly, view it on GitHub https://github.com/LiLabAtVT/DeepTE/issues/30, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEEENQTXPKQILG66A3WBZLYFRADFAVCNFSM6AAAAAA7UACCYGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDGNJSHA4DGNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Associate Professor in Plant Genomics and Bioinformatics School of Plant and Environmental Sciences Virginia Polytechnic Institute and State University

zoom https://virginiatech.zoom.us/j/8790572835