GonzalezLab / MCHelper

MCHelper: An automatic tool to curate transposable element libraries
GNU General Public License v3.0
27 stars 3 forks source link

Questions about process files #8

Closed SWei2333 closed 1 month ago

SWei2333 commented 4 months ago

Hi, thank you very much for developing this software. I have some questions about its operation that I would like to ask.

My command is "python3 ~/MCHelper/MCHelper.py -r A -t 20 -l Eb_families.fa -o Eb.fasta_output -g Eb.genome.chr.v3.fasta --input_type fasta -b ./mammalia_odb10.hmm -a F "

I have been running for two days, but the result file only contains some files, and a folder “classifiedModule” which has some empty files. Is there any wrong ? Thank you for your help! 18M Apr 21 17:07 candidate_tes.fa 20 Apr 23 15:52 classifiedModule 5.8M Apr 21 17:10 non_redundant_lib.fa 1.9K Apr 21 17:07 sequences_with_problems.txt

classifiedModule: 0 Apr 22 10:15 Chr10:4533629..4534041_LTR.copies.fa 0 Apr 21 23:18 Chr1:32470769..32471133_LTR.copies.fa 0 Apr 22 06:29 Chr14:17413840..17414191_LTR.copies.fa 0 Apr 21 23:07 Chr17:10292218..10292603_LTR.copies.fa 0 Apr 21 17:19 Chr20:31121217..31121594_LTR.copies.fa 0 Apr 22 06:26 Chr2:33735733..33736113_LTR.copies.fa 0 Apr 21 17:19 Chr26:17425524..17425860_LTR.copies.fa 0 Apr 21 17:52 Chr3:126413529..126413871_LTR.copies.fa 0 Apr 22 23:51 Chr4:135925862..135926222_LTR.copies.fa 0 Apr 22 02:06 Chr7:4387625..4388046_LTR.copies.fa 0 Apr 23 07:09 rnd-1_family-265.copies.fa 0 Apr 22 05:44 rnd-1_family-270.copies.fa 0 Apr 21 17:19 rnd-1_family-351.copies.fa 0 Apr 21 17:19 rnd-1_family-505.copies.fa 0 Apr 22 15:58 rnd-4_family-1061.copies.fa 0 Apr 23 14:17 rnd-5_family-1512.copies.fa 0 Apr 23 13:23 rnd-6_family-3870.copies.fa 0 Apr 21 17:19 rnd-6_family-779.copies.fa 0 Apr 23 15:52 scaffold1:18121064..18121635_LTR.copies.fa 0 Apr 22 08:47 scaffold28:5584399..5584726_LTR.copies.fa

simonorozcoarias commented 4 months ago

Hi !! thank you for using MCHelper.

Yes, MCHelper takes a bit long time to extend the consensus sequences, especially in genomes with thousands of copies (like in mammals). So, probably it's better to run the software with more cores, or just wait a little longer.

Best regards,

Simon.

SWei2333 commented 4 months ago

Thank you for your helpful advice! I have successfully finished a manual BEE module, and the output is a file called "extended_cons.fa". I observed that the number of cons has increased compared to the original ones, and there are occurrences of ">rnd-1_family-58_s_1" and ">rnd-1_family-58_s_2". Do these represent subfamilies?

Additionally, after completing the manual BEE module (-r E, -a M), I am planning to proceed with the next module using the manual option (-r M, -a M). Would it be reasonable to proceed in this manner?

Thank you for your help!

SWei2333 commented 4 months ago

When I tried using the Manual Inspection Module, I noticed that the resulting output included several *orf.fa files, such as "rnd-6_family_5610_profiles_found.hmm; rnd-6_family-5610_putative_te.fa; rnd-6_family-5610_putative_te_orf.fa," along with a temp folder containing the cons' db file. However, I couldn't find any information about the module containing a structural checking step in your article. Is it correct to progress through the modules in the order of A, B, C manually?

simonorozcoarias commented 4 months ago

Hi!!

Yes the BEE module also splits some sequences into subfamilies. So, those with s_1 and s_2 are sequences that came from the same original consensus sequence.

Depending on your goal. The most common workflow (at least for the people I've heard of) is to run first all the modules using the automatic mode (-r A -a F). Then, to run the manual inspection module using the output from the automatic curation (the file curated_sequences_NR.fa). The parameter to do that is just -r M.

Following this way you can take advantage of the automatic flow (you can run it in your cluster with a lot of cores), and also to check every consensus sequences manually to remove those that are still not good. In this way, you can create TE reference libraries with a super high quaility in a short period of time.

I hope this info is useful for you.

Best,

Simon.

SWei2333 commented 4 months ago

Dear Simon Orozco-Arias Thank you for your previous advice. I have tested an RM2 consensus FASTA file using the automatic module (-r A -a F). After the process was completed, I attempted to analyze the results and encountered some questions.

  1. There are too many incomplete consensus sequences. Regarding my initial consensus sequences being classified as SINE, and obtaining approximately 121 out of 137 incomplete sequences, if I wish to determine the specific proportions of TE class, it would be advisable to perform a secondary classification? Could you please recommend any methods you suggest?

  2. The level of classification. MCHelper only classifies TEs into order(LTR LINE SINE DNA). If I want to obtain the superfamily, would it be advisable for me to use another classification software such as DeepTE to achieve my goal? Alternatively, does MCHepler have a method to generate the superfamily?

  3. How does MCHepler handle chimeric consensus sequence?

  4. How does MCHepler identify special TEs, like solo-LTR?

I am grateful for any response you can provide, and I wish you a happy and successful life and work!

Best wishes

For the secondary classification of incomplete sequences, I recommend using a method that takes into account the available information and characteristics specific to your data. This could involve utilizing additional features such as sequence length, structural motifs, or any other relevant criteria that can aid in further categorizing the incomplete sequences. Consider consulting domain experts or referring to established classification methods within your research field to ensure the most appropriate approach.

Simon Orozco-Arias @.***> 于2024年4月29日周一 21:38写道:

Hi!!

Yes the BEE module also splits some sequences into subfamilies. So, those with s_1 and s_2 are sequences that came from the same original consensus sequence.

Depending on your goal. The most common workflow (at least for the people I've heard of) is to run first all the modules using the automatic mode (-r A -a F). Then, to run the manual inspection module using the output from the automatic curation (the file curated_sequences_NR.fa). The parameter to do that is just -r M.

Following this way you can take advantage of the automatic flow (you can run it in your cluster with a lot of cores), and also to check every consensus sequences manually to remove those that are still not good. In this way, you can create TE reference libraries with a super high quaility in a short period of time.

I hope this info is useful for you.

Best,

Simon.

— Reply to this email directly, view it on GitHub https://github.com/GonzalezLab/MCHelper/issues/8#issuecomment-2082779346, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE73QVWHIPDZAHKD5ZTY7ZEOLAVCNFSM6AAAAABGUYB5LWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBSG43TSMZUGY . You are receiving this because you authored the thread.Message ID: @.***>

SWei2333 commented 4 months ago

Dear Simon Orozco-Arias

I apologize for interrupting you again, but I would like to add one more question. When I attempted TE classification using other methods, I realized that the selection of a reference library is crucial in homology blast. I noticed that your reference TE library includes MCTE, Flybase, Dfam, and Repbase. My target species is a mammal, the presence of an excessive number of Drosophila TEs in the reference library may potentially impact the alignment results. Do you have any recommendations on how to address this issue? Where should I make modifications to the library specification?

Thank you for your help!

Best wishes

simonorozcoarias commented 4 months ago

Hi !!

Well I will try to answer each question separately.

  1. For MCHelper, the incomplete TEs are those that they don't contain all the structural features you expect, or they contain other structural feature that is not expected at all (for example an endonuclease domain in a LTR element). So, in the case of non-autonomous TEs like SINE, this is difficult to manage because they basically don't have any structural features (like terminal repeats or domains), apart for the polyA tail, that is not always found. That's, in my opinion, why MCHelper is marking your SINE elements as incompletes. Nevertheless, you can check them in order to figure out if they are incomplete or not (I'm not expert in SINE elements, sorry). If you're analyzing a mammal genome, you can use a curated library to look for homolgy, following the 80-80-80 rule. That is a way to classify them into superfamilies as well.

  2. In general, most of the superfamily classifications are made by homology. So, as I mentioned for the SINE elements, you can blast all your sequences against a curated mammal library. Dfam has plenty of good curated mammal consensus sequences. In this case, you can follow again the 80-80-80 rule. You can also use a DL-based method to classify them, but I highly suggest you to combine this analysis with another approach (based on homology on in strucutre), since I saw several cases where the DL based have done incorrect classifications.

  3. MCHelper doesn't handle chimeric sequences. In the output file named "denovoLibTEs_PC.classif" MCHelper puts in the last column if it found evidence for chimeric elements or not. Nevertheless, that evidence is not enough to say that a TE is chimeric or not. In that sense, we really suggest to our users to do a manual inspection of all the consensus sequences, to remove undesirable sequences (like chimerics or satellites).

  4. MCHelper can generate different consensus sequences for the complete LTR element and for the soloLTR, in the case that the later exists in the genome. Nevertheless, the software doesn't change the classification to soloLTR. Then, in this case, it is the user who is responsible for conveniently renaming it.

  5. Yes, you can add more curated sequences from any close mammal species that you have access to. The only thing you need to do is append the sequences to the file named "allDatabases.clustered_rename.fa", which is located in the db folder. Remember delete the files allDatabases.clustered_rename.fa.n* to force MCHelper to take into account the changed version of the file.

I hope I answered all your questions, and thank you again for using MCHelper.

Best,

Simon.

SWei2333 commented 4 months ago

Dear Simon Orozco-Arias

Thank you for your response! I have tested several species using my new 'allDatabases.clustered_rename.fa' file. However, I encountered a strange result where one species had an initial consensus of 730 sequences, but after using the default parameters in MCHelper, both the 'curated_sequences_R.fa' and 'curated_sequences_NR.fa' files only contained two sequences.

There were no error messages during the process. I observed that all 730 sequences experienced the BEE module, but after BLAST, only two remained. In the 'fullLengthFrag.txt' file, except for two sequences, all other entries in the last four columns are '0.' This appears to be the result of false positives filtering out many sequences. Can this be considered reliable, or could there be errors in the RepeatModeler results?

Looking forward to your reply, thank you very much!

There is my progress information below.

MESSAGE: Output folder /home/Mog/Mog.fasta_output created. MESSAGE: Missing -s parameter, using by default: 60 MESSAGE: Missing -e parameter, using by default: 500 MESSAGE: Missing -x parameter, using by default: 16 MESSAGE: Starting with Classified module... WARNING: There are some sequences with problems in your library and MCHelper cannot process them. Please check them in the file: /home/Mog/Mog.fasta_output/sequences_with_problems.txt MESSAGE: Starting with BEE step ...

Total stats : Sequences kept: 2

TEs detected as complete: 1 TEs detected as incomplete: 1 TEs manually analyzed : 0 TEs sent to Unclassified module : 0

MESSAGE: There is no sequences to unclassified Module, skipping ... WARNING: the file /home/Mog/Mog.fasta_output/unclassifiedModule/kept_seqs_unclassified_module.fa was not created. Please check.

Best wishes

simonorozcoarias commented 4 months ago

Hi !!

Mmmm I see. It looks like the full-length filtering is removing all but two sequences. In the default parameters of MCHelper, the consensus sequences are required to have at least 1 full length copy in the genome, that's why MCHelper is removing almost everything. One thing you can do is check the TE+Aid plots in the classifiedModule folder and check if the consensus were over-extended. You can recognize that looking at the two top plots (the divergence and the coverage plots). The over-extended sequences have coverage of zero at the beginning and at the end of the sequences.

If you find that, my suggestion is running the MCHelper again but using less iterations for the BEE process. You can try with 5 or 10 (-x parameter).

Best,

SWei2333 commented 4 months ago

Dear Simon Orozco-Arias In my TE-Aid, there are only graphs for these two sequences, and there isn't the situation you described with both ends having no coverage. Should I look at the TE-Aid for the filtered-out sequences, and how can I access that part?

SWei2333 commented 4 months ago

To try addressing this issue, I conducted some experiments myself. I manually blasted the unextended original consensus sequences against the genome using blastn (10e-8). The results revealed that there are indeed numerous consensus sequences that cannot be found in the genome with hits reaching 94% of the consensus length before extension. However, some consensus sequences do have such hits.

simonorozcoarias commented 4 months ago

Oh you're right!! the plots are only generated to those that passed all the filters. Sorry, it was my bad. You can run MCHelper again changing the minimum number of full length copies to 0 (-c 0 parameter). This way you shouldn't lose the TEs that have not any full length copies.

After running MCHelper this way, you can find the TE+Aid for them and check if there is any over-extension problem.

Best,

SWei2333 commented 4 months ago

Thank you for your advice, I'll try! In addition, I would like to ask you another question. I don't fully understand why a consensus without more than one FLF copy in the genome is considered a false positive. When I curated manually before, I only considered it reasonable if there were more than 10 hits aligned in the genome.

simonorozcoarias commented 4 months ago

This is based in the Jamilloux et al. paper (https://doi.org/10.1109/JPROC.2016.2590833). They show that considering only those consensus sequences that have at least one full length copy in the genome, you can reduce a lot the manual curation needed for having a good library. It also makes sense in a way that you would expect that every consensus sequence you have in the library is the representation of a group of TE copies (called a TE family). Actually, we have observed that most of the consensus sequences without any full-length copies are indeed not real TEs, for examples chimeras.

Nevertheless, this can change from species to species, and also depending on the methodology you used for the de novo detection of TEs.

Best,

SWei2333 commented 3 months ago

Mo2.fasta_output.zip https://drive.google.com/file/d/1DjoA4nvGpMD3JwqaCt4ex-zurAdAjgtV/view?usp=drive_web Dear Simon Orozco-Arias

I randomly selected 55 consensuses from this species for testing (-x 5 -c 1). I found that 28 out of the 55 were retained. Among them, I noticed something very peculiar about one consensus (ltr-1_family-11). Looking at the lengths in fullLengthFrag.txt, it went from the original 476bp to 453bp, indicating that it wasn't extended multiple times but rather trimmed shorter. Theoretically, it should have been complete. What's strange is that during my initial attempt, this sequence was also 453bp after extension, but it wasn't retained. This time, with just a change in iteration count, it was retained at 453bp, which has left me puzzled.

Furthermore, after reducing the iteration count, I find it challenging to determine from the TE-Aid graph whether it has been fully extended. What level of coverage at the ends of the sequence or how much length of sequence coverage needs to remain low for it to be considered complete?

Finally, if I want to curate many species using MCHepler, modifying the iteration count based on different species would greatly increase the workload. However, if I reduce the iteration count due to concerns about over-extension, I worry that many sequences may be incomplete (although I believe the results will still be better than the original results from RM2). If I choose to stick with a uniform 16 iterations and change the false positive filtering criterion (-c) to 0, it seems like it could introduce significant errors and increase the workload. Given this situation, do you have any suggestions?

I would greatly appreciate any advice you're willing to offer. Wishing you a happy life!

I have attached the results file from my 5 iteration counts. I would greatly appreciate your guidance whenever you have the time. Thank you very much!

Best wishes

Simon Orozco-Arias @.***> 于2024年5月9日周四 21:02写道:

This is based in the Jamilloux et al. paper ( https://doi.org/10.1109/JPROC.2016.2590833). They show that considering only those consensus sequences that have at least one full length copy in the genome, you can reduce a lot the manual curation needed for having a good library. It also makes sense in a way that you would expect that every consensus sequence you have in the library is the representation of a group of TE copies (called a TE family). Actually, we have observed that most of the consensus sequences without any full-length copies are indeed not real TEs, for examples chimeras.

Nevertheless, this can change from species to species, and also depending on the methodology you used for the de novo detection of TEs.

Best,

— Reply to this email directly, view it on GitHub https://github.com/GonzalezLab/MCHelper/issues/8#issuecomment-2102617605, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE6CPYAVV7REEIUVMYLZBNXWJAVCNFSM6AAAAABGUYB5LWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSGYYTONRQGU . You are receiving this because you authored the thread.Message ID: @.***>

SWei2333 commented 3 months ago

Dear Simon

I iterated through the selected consensus sequences for 5 rounds and 16 rounds respectively. In both attempts, 35 out of 56 consensus sequences were retained. This raised my suspicion that there might have been some issues during the blastn alignment in the software, causing the subsequent consensus sequences not to align with the whole genome, resulting in a direct output of 0, which was then filtered out as false positives. However, I observed no errors during the runtime process. I am now re-aligning all consensus sequences for this species, setting the iteration to 10 rounds, to see if blastn can still function properly.

Best wishes

wei shen @.***> 于2024年5月11日周六 11:28写道:

Mo2.fasta_output.zip https://drive.google.com/file/d/1DjoA4nvGpMD3JwqaCt4ex-zurAdAjgtV/view?usp=drive_web Dear Simon Orozco-Arias

I randomly selected 55 consensuses from this species for testing (-x 5 -c 1). I found that 28 out of the 55 were retained. Among them, I noticed something very peculiar about one consensus (ltr-1_family-11). Looking at the lengths in fullLengthFrag.txt, it went from the original 476bp to 453bp, indicating that it wasn't extended multiple times but rather trimmed shorter. Theoretically, it should have been complete. What's strange is that during my initial attempt, this sequence was also 453bp after extension, but it wasn't retained. This time, with just a change in iteration count, it was retained at 453bp, which has left me puzzled.

Furthermore, after reducing the iteration count, I find it challenging to determine from the TE-Aid graph whether it has been fully extended. What level of coverage at the ends of the sequence or how much length of sequence coverage needs to remain low for it to be considered complete?

Finally, if I want to curate many species using MCHepler, modifying the iteration count based on different species would greatly increase the workload. However, if I reduce the iteration count due to concerns about over-extension, I worry that many sequences may be incomplete (although I believe the results will still be better than the original results from RM2). If I choose to stick with a uniform 16 iterations and change the false positive filtering criterion (-c) to 0, it seems like it could introduce significant errors and increase the workload. Given this situation, do you have any suggestions?

I would greatly appreciate any advice you're willing to offer. Wishing you a happy life!

I have attached the results file from my 5 iteration counts. I would greatly appreciate your guidance whenever you have the time. Thank you very much!

Best wishes

Simon Orozco-Arias @.***> 于2024年5月9日周四 21:02写道:

This is based in the Jamilloux et al. paper ( https://doi.org/10.1109/JPROC.2016.2590833). They show that considering only those consensus sequences that have at least one full length copy in the genome, you can reduce a lot the manual curation needed for having a good library. It also makes sense in a way that you would expect that every consensus sequence you have in the library is the representation of a group of TE copies (called a TE family). Actually, we have observed that most of the consensus sequences without any full-length copies are indeed not real TEs, for examples chimeras.

Nevertheless, this can change from species to species, and also depending on the methodology you used for the de novo detection of TEs.

Best,

— Reply to this email directly, view it on GitHub https://github.com/GonzalezLab/MCHelper/issues/8#issuecomment-2102617605, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE6CPYAVV7REEIUVMYLZBNXWJAVCNFSM6AAAAABGUYB5LWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSGYYTONRQGU . You are receiving this because you authored the thread.Message ID: @.***>

Jay0xd commented 3 months ago

I have a problem with like that. Does someone have any issue

FileNotFoundError: [Errno 2] No such file or directory: 'CIAlign'

SWei2333 commented 3 months ago

I have a problem with like that. Does someone have any issue

FileNotFoundError: [Errno 2] No such file or directory: 'CIAlign'

sorry, I didn't have this error

simonorozcoarias commented 3 months ago

CIAlign

That error can appears when you have troubles installing the CIAlign program. To test that, activate the MCHelper environment and then try to run CIAlign in the terminal.

Best,

simonorozcoarias commented 3 months ago

Dear Swei,

If you want to curate a lot of species, you can try first changing the number of iteration for a few species (I think you are already doing this). Then , try to find which is the best (and faster) way to generate curated libraries for all of them.

Another thing you can try is to remove the blast index created for your genome. They are those files named as genome.fasta.nhr, genome.fasta.nin and so on. After remove them, try to run MCHelper again. Maybe this is just a problem of the blast index created.

I hope this is helping you!

Best,

SWei2333 commented 3 months ago

Dear Simon Thank you for your help. I have a deeper understanding of the software now, and I obtained a result where most consensus sequences have their superfamilies, except for SINEs. The classification of superfamilies seems limited(LINE:CR1、L1、L2;LTR:ERV、LARD、TRIM;TIR:HAT、P). I want to know if the presence of superfamilies is a result of changes I made to the library and if the classification is correct.

Wishing you a happy life!

Simon Orozco-Arias @.***> 于2024年5月16日周四 22:54写道:

Dear Swei,

If you want to curate a lot of species, you can try first changing the number of iteration for a few species (I think you are already doing this). Then , try to find which is the best (and faster) way to generate curated libraries for all of them.

Another thing you can try is to remove the blast index created for your genome. They are those files named as genome.fasta.nhr, genome.fasta.nin and so on. After remove them, try to run MCHelper again. Maybe this is just a problem of the blast index created.

I hope this is helping you!

Best,

— Reply to this email directly, view it on GitHub https://github.com/GonzalezLab/MCHelper/issues/8#issuecomment-2115469278, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE3U32ZIZVH5HCJP2GDZCTCDNAVCNFSM6AAAAABGUYB5LWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJVGQ3DSMRXHA . You are receiving this because you authored the thread.Message ID: @.***>

simonorozcoarias commented 3 months ago

Hi Swei,

In the current version of MCHelper we do not have the superfamilies for SINEs. So, unfortunately, they have to be added manually. MCHelper uses an extended version of the superfamily nomenclature proposed by Wicker et al. (2007; https://doi.org/10.1038/nrg2165). So, the superfamilies that MCHelper works with are:

Class I: Retrotransposons Order: LTR; superfamilies: Copia, Gypsy, Bel-Pao, ERV, TRIM, LARD Order: LINE; superfamilies: CR1, R1, R2, RTE, JOCKEY, L1, L2, LOA, I, Order: SINE Order: PLE Order: DIRS; superfamilies: DIRS, Ngaro, VIPER Class II: DNA Transposons Order: TIR; superfamilies: Tc1-Mariner, hAT, Mutator, Merlin, Transib, P, PiggyBac, PIF-Habinger, CACTA, Mule, CMC Order: MITE Order: Crypton Order: Helitron Order: Maverick Unclassified/Unknown

Thank you again for using MCHelper !!!

Best,

Simon.