Ramkyeri commented 1 year ago

          > parse

@mhemberg

Hai did you get restults by running these comments

"singularity exec docker://dfam/tetools:latest BuildDatabase -name sequence_index -engine ncbi sequence.fasta singularity exec docker://dfam/tetools:latest RepeatModeler -database sequence_index -pa 32 -LTRStruct > run.out singularity exec docker://dfam/tetools:latest RepeatMasker -pa 32 -a -s -gff -no_is -lib metazoa sequence.fasta"

I am also using the same environment. please send me your suggestions.

with regards

Ramky

_Originally posted by @Ramkyeri in https://github.com/DerKevinRiehl/transposon_annotation_tools/issues/3#issuecomment-1503368850_

Ramkyeri commented 1 year ago

Dear Kevin, I hope that you are doing well.

As per your suggestion, I am running both repeatmodeler and repeatmasker in docker container.

Repeatmasker run successfully (RepeatMasker -pa 10 demo.fasta).

But, Repeatmodeler shows error. I did not undeser stand this error.

My computer is 16 cores, I tried 1-16, many times, it shows the same error.

I am using WLS2 in windows 11.

I am also wonndering regarding species name in Repeatmasker, I am wokring on Phyllostachys edulis,

If this is not available, what should I do.

Please send me your suggestion. Many thanks for your kindness.

with regards

DerKevinRiehl commented 1 year ago

Hi Ramkyeri, how about you follow the error message and write "-threads" instead of "-pa"?

RepeatModeler -engine ncbi -threads 4 -database demo_index

How about you work on a Linux machine instead of Windows? As a bioinformatician you will face many issues and limitations working with Windows.

If you run RepeatMasker with a species that is unknown to RepeatMasker, you have no other choice than taking the one available which is genetically / biologically closest to your species of interest. As they suggest, you can find a list of supported species here repeatmasker/RepeatMasker/Libraries/taxonomy.dat

Hope this could help a little, wish you success with your endeavour :-)

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin,

Many thanks for your kind reply.
RepeatModeler -engine ncbi -threads 4 -database demo_index run successfully.

in repeatmasker/RepeatMasker/Libraries/taxonomy.dat,
I found the following IDs, 38705 phyllostachys heterocycla;
38705 mhousou; 38705 moso bamboo; 38705 phyllostachys heterocycla f. pubescens 38705 phyllostachys edulis 38705 mosochiku 38705 kikko-chiku 38705 phyllostachys pubescens var. heterocycla 38705 bambusa edulis phyllostachys edulis. These are all same species.

but It shows "Species "name" is not known to RepeatMasker".

But when when I run with -species rice or arabidopsis. it run successfully.

Alos, in this file FinalAnnotations_Transposons.gff3", all transposons hane unique IDs, but I did not see the nested repeat IDs # I mean that insertion IDs or duplicate copy IDs.

nested repeats are produced by ID colum in repeatmasker GFF file.

If I get like that, I can find where the same copies are expressed or where the same copies are not expressed.

I hope that you understand this one.

Please send me your suggestion. Many thanks for your kindness.

with regards

Ramkyeri commented 1 year ago

Dear Kevin,

RepeatMasker also runs in ubuntu.

conda config --add channels bioconda conda config -- add channelsconda-forge conda install -c bioconda repeatmasker`

Also, RepeatModeler runs in ubuntu, but in ubuntu, it did not proudce these two files, families.fa, families.stk,

it is produced in docker container,

when I run this reasonaTE -mode checkAnnotations -projectFolder workspace -projectName testProject with the files generated from linx environment, it is not completed whereas it is completed with the files generated from docker container.

So these two are also important.

However, Still I am not able to find the close species to Moso bamoo. In this file

"repeatmasker/RepeatMasker/Libraries/taxonomy.dat" only we can find only ncbi taxonomy id.

Please send me your suggestion.

thank you for kindness.

with regards

Ramkyeri commented 1 year ago

Dear Kevin, i am not understanding this, could you have some example for moso bamboo, how to use it.

`famdb.py names -h usage: famdb.py names [-h] [-f ] term [term ...]

List the names and taxonomy identifiers of a clade.

positional arguments: term search term. Can be an NCBI taxonomy identifier or part of a scientific or common name

optional arguments: -h, --help show this help message and exit -f , --format choose output format. The default is 'pretty'. 'json' is more appropriate for scripts.`

with regards

DerKevinRiehl commented 1 year ago

Dear Ramkyeri, happy to hear you make progress.

Sorry, but as I am not the developer of RepeatMasker / RepeatModeler, I recommend you to contact them: https://www.repeatmasker.org/ https://www.repeatmasker.org/RepeatModeler/

Hope this helps and that they can help you, Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kindness.

RepeatModeler is a de novo transposable element family identification but it will not produce GFF file.
RepeatMasker is based on using query species (using “-species” command), or custom library (using “-lib” command).
In order to generate a custom library for -lib command, we can use the output file families.fa generated by RepeatModeler.
My question is that: Will your pipeline TransposonUltimate use file “-families.fa” in the final annotation?. I think the pipeline used the file “-families.fa”, but i am not sure.

If TransposonUltimate use file families.fa to generate to final results, that will be good.

Moso bamboo is closer to rice and maize. Thus, I am planning to run with both rice and maize, and then combine it before running this command , reasonaTE -mode parseAnnotations -projectFolder workspace -projectName testProject

But I do not know that what are the files need to combined or all output files of RepeatMasker. Thank you in advance for your kindness, please give your suggestion.

with regards Ramky

DerKevinRiehl commented 1 year ago

Dear Ramky, great to see you make progress.

To your question: TransposonUltimate uses "sequence_index-families.stk" from RepeatModel and "sequence.fasta.out" from RepeatMasker.

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kind motivation and quick reply.

I got it TransposonUltimate (easonaTE -mode parseAnnotations -projectFolder workspace -projectName testProject) uses index files of the all the tools, not the fasta file.

but the index file 'sequence_index-families.stk' might be based on the families.fa.

Is it correct, if I am correct, I can the the pipline even without repeatmasker.

or even If I include repeatmasker, but I may get only few additional inoformation.

Thank you in advance for your kindness, please give your suggestion.

with regards Ramky

DerKevinRiehl commented 1 year ago

Dear Ramky, that is correct.

You can run the step "parseAnnotations" and reasonaTE will check which tools you ran before. There is no need to run all tools before. So if you only run "repatmasker" or only "repeatmodeler" or both, reasonaTe will be able to handle it.

Yes, I think families.fa and the sequence_index-families.stk are related to each other, so I am positive that the information will be included into the workflow.

Best regards, Kevin

Ramkyeri commented 1 year ago

Dear Kevin,
Many thanks for your kind reply.
I will update you the results. with regards Ramky

Ramkyeri commented 1 year ago

Dear Kevin, In reasonaTE, Step 2) Annotate genome with annotation tools, I follow the recommend Option 2.

while runinng this command reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool sinescan

I got the following message,

Invalid command line Unknown option in

------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Could not read file '/mnt/f/TransposonUltimate/Ramky/workspace/testProject/sinescan/result/RG/1/1.sine.extendseq.msa.fasta': No such file or directory STACK: Error::throw STACK: Bio::Root::Root::throw /home/ramkyas20/transposon_annotation_tools_env/lib/site_perl/5.26.2/Bio/Root/Root.pm:447 STACK: Bio::Root::IO::_initialize_io /home/ramkyas20/transposon_annotation_tools_env/lib/site_perl/5.26.2/Bio/Root/IO.pm:268 STACK: Bio::AlignIO::_initialize /home/ramkyas20/transposon_annotation_tools_env/lib/site_perl/5.26.2/Bio/AlignIO.pm:401 STACK: Bio::AlignIO::new /home/ramkyas20/transposon_annotation_tools_env/lib/site_perl/5.26.2/Bio/AlignIO.pm:311 STACK: Bio::AlignIO::new /home/ramkyas20/transposon_annotation_tools_env/lib/site_perl/5.26.2/Bio/AlignIO.pm:332 STACK: main::Similarity /home/ramkyas20/transposon_annotation_tools_env/bin/SINE_Scan-v1.1.1//PL_pipeline/RG_boundary.pl:205 STACK: /home/ramkyas20/transposon_annotation_tools_env/bin/SINE_Scan-v1.1.1//PL_pipeline/RG_boundary.pl:22

Could you some suggestion, what is this,

but at the end,

Your genomic dataset has no reasonable SINE candidates Annotation by software sinescan finished successfully...

demo file does not have SINE candidates. That is understand.

with regards

Ramkyeri

DerKevinRiehl commented 1 year ago

Hi Ramky, to me it sounds that sinescan is not installed on your environment properly.

In which conda environment are you operating? reasonaTE environment or transposon_annotation_tool_environment?

Can you just type "sinescan" in your console and see what happens?

sinescan

Does ist say command not found?

Under the hood, reasonaTE actually calls sinescan like that... (adjusted to your folder paths)

sinescan -s 123 -g /mnt/f/TransposonUltimate/Ramky/workspace/testProject/sequence.fasta -o /mnt/f/TransposonUltimate/Ramky/workspace/testProject/sinescan/output -d /mnt/f/TransposonUltimate/Ramky/workspace/testProject/sinescan/result -z /mnt/f/TransposonUltimate/Ramky/workspace/testProject/sinescan/final

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kind reply.
It is in transposon_annotation_tools_env sinescan

Also, now I running the program with moso bambo genome.

This is original fille name "Phyllostachys_edulis_V2.Hic.genome", I felt that it is not in fasta type, so I changed this file into fasta type , using this command awk '{ printf ">%s\n%s\n",$1,$2 }' Phyllostachys_edulis_V2.Hic.genome > sequence.fasta

is it correct?
I also changed the file name to sequence.fasta

Now I am in the first step reasonaTE -mode createProject -projectFolder workspace -projectName testProject -inputFasta sequence.fasta

But it is running more than 1 hour to creat project.

Thank you in advance for your kindness, please give your suggestion.

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

Question: did you delete the folder "testProject" from your previous attempt?

Question: how big is "sequence.fasta" and could you show me like the first 10 lines of the file?

Thanks, Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, many thanks

Yes, I deleted the testproject, but there are two testProject in two different subfolders in folder TransposonUltimate.

One folder is for demo file, second folder is Moso bamboo.

the genome size is about 2034 Mb.

file name sequence.fasta

file name Phyllostachys_edulis_V2.Hic.genome

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

Dear Ramky, I just checked the code that creates the project.

QQ: Is folder structure already created?

QQ2: Do you see in your project folder a file called "sequence.fasta" created? QQ3: Do you see in your project folder a file called "sequence_heads.txt" created? QQ3: Do you see in your project folder a file called "sequence_rc.fasta" created?

Maybe it just takes some time due to the size of your genome (but actually its just 2GB, so not too big). You should actually see that these three files are generated, and at the end sequence.fasta and sequence_rc.fasta should have around the same size like your original fasta file.

Do you have enough disk space on your machine?

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your valuble reply. It is created, but the size is not same as to the original fasta file. Both files are around 300 MB,

but sequence_heads.txt is 2.09 GB size.

My machine is 128 GB RAM and it has enough space.

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin,
i am sorry, file name sequence.fasta is not in the correct format.
each line starts with >

Becaue of this, the sequence_heads.txt has 2.09 GB size. It is consided each line as header.

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin,
Sorry for the inconvenience. First, In this linux, I converted genome file into fasta file in which each line starts with >.

Therefore, what I did I copied the the sequences from genome file to your orginal sequnce.fasta file. Then I also compared both file in text compare, and both were similar.

Now it created the project within 20 minutes.
sequence.fasta is created (1.8 GB) sequence_heads.txt is created sequence_rc.fasta is created (1.8 GB). This is similar to original file size (1.79 GB).

But, I did not find this one QQ: Is folder structure already created?

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin, I alos notied that Header name >moso_draft_hic_scaffold_1 has been changed to >seq1

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin,

I think long path names can be a problem, or we should not keep any files on the same folder.

Now I am running reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool helitronScanner. it run successfully,

but while running, this message, but it completed successfully.

Thank you

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin,
These two run successfully. reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool helitronScanner reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool ltrHarvest

helitronScanner took four hours, but ltrHarvest took only 30 minutes.

Do you have any idea which step will take more time, or days.

with regards

Ramkyeri

DerKevinRiehl commented 1 year ago

Dear Ramky, it is pretty normal to wait for several days / weeks (like 2 weeks). How long which step takes really depends on your machine / cluster and genome. It depends. You cannot expect to run things on your laptop and think it works in one hour, this consumes quite a lot of resources.

For the sequence renaming: that is pretty normal, and the file sequence_heads.txt will tell you how it was renamed to work internally.

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kind reply and suggestions. I will update you.

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin, I hope that you are doing well. So far, only fours tools have been completed,

Unfortunately while running this command reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool must the system has issue with monitor flickering, So I restart the computer.

Now I wonder how to continue the analsysis.

Shall I delete the output files in the must folder, (I think there is only one folder for must)
or shall I continnue without deleting output files, if I follow this, is it replace the files?

also, I do not have money to use online server, I would like to collaborate with someone, who offer me to use their server for mutual benefits.

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin,

In the must folder, i compared the output files, in my analysis, there are lot of files, (But these analysis not yet completed).

In your demo folder, there is only file called "Result"

According to my understanding, must first split the genome into 10 group, then analysis and combine it. Then it will delete the unnecessary files and keep only result file.

am I correct?

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

Dear Kevin, I hope that you are doing well. So far, only fours tools have been completed,

Unfortunately while running this command reasonaTE -mode annotate -projectFolder workspace -projectName testProject -tool must the system has issue with monitor flickering, So I restart the computer.

Now I wonder how to continue the analsysis.
1. Shall I delete the output files in the must folder,  (I think there is only one folder for must)

2. or shall I continnue without deleting output files, if I follow this,  is it replace the files?
also, I do not have money to use online server, I would like to collaborate with someone, who offer me to use their server for mutual benefits.

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

Dear Ramkyeri, I recommend you clean the must folder completely, and then run the tool for must specifically again.

If you want to collaborate with someone, you could "split up the work" by running single tools.

Best, Kevin

DerKevinRiehl commented 1 year ago

Dear Kevin,

In the must folder, i compared the output files, in my analysis, there are lot of files, (But these analysis not yet completed).

In your demo folder, there is only file called "Result"

According to my understanding, must first split the genome into 10 group, then analysis and combine it. Then it will delete the unnecessary files and keep only result file.

am I correct?

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

After my tool finished running must, it will delete temporary files and just keep some result files. Do not interpret or try to create files by yourself. Just let the tool run by itself.

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kind reply and suggestions. I will update you. Could you recommend someone for collaboration, I am also very much happy to collaborate with you. But I am wonder do you have time?

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

Wrote you an email, you received it? Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your mail.

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin, Greetings! Unfortunately, again, the system has issue with monitor flickering, So I restart the computer.

I am using this one "Option 2: annotate with one specific tool (good for parallelization or rerunning, recommended). It is mandatory to run the protein annotation tools transposonPSI and NCBICDD1000 for the next steps.

I am thinking to run sinefind instead of must,

Shall I run all the tools in order, or random?

I did not undetstand this one "It is mandatory to run the protein annotation tools transposonPSI and NCBICDD1000 for the next steps."

you mean that these two are tools important for the next step means (reasonaTE -mode pipeline -projectFolder workspace -projectName testProject)

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

Dear Ramky, great to hear you progress.

Yes random order is fine.

The two tools "transposonPSI" and "NCICDD1000" need to be run, all other tools do not necessarily need to be run.

Best and good luck, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kind reply.

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin, now "sinefind" is running, I hope that the follwing output in the terminal is not error message.

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

No no its fine :-)

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kind reply.

the tool "sinefind" was successfully completed within 3 hours. However, I noticed the following error message.

I also checked the results output file in the folders , the file was only 8 KB, and 9 KB.

I think that sinfind did not call all, seq2 to seq19684.

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

Hi Ramky, sorry to hear but we are not the authors of sinefind, we just wrapped it into a conda package. Maybe the genome is just too big for these kind of genomes?

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your reply. I understood. I will check the orginal source and update you.

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin, SineFinder was published in plant cell https://doi.org/10.1105/tpc.111.088682 in 2010. Since this tool is 13 years old, I could not find any updates of this tool. So I think, I can skip this tool. However, repeatmodel and repMasker can cover SINE elements. Am I correct? I kindly request you to give your suggestion.

I am also sorry for sharing this one, The author of SineFinDer Prof. Thomas Schmidt lost his life on 1st August 2019.

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

Dear Ramky, please read our paper, you could find answers to almost your questions you asked before there. https://academic.oup.com/nar/article/50/11/e64/6541023

(Hint: Table1)

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kind reply. I also found some answer. AnnoSINE finds SINEs, in this tool, SineFinder showed the error for larger genome.

They suggested this "Modified SINEFinder.py to run in 'chunkwise' instead of 'seqwise' mode. Script attached"

I checked the script also

If you have time, please go through this link,

https://github.com/yangli557/AnnoSINE/issues/2

Thank you again for your kind words. I am learning a lot from you. Your tool is really great, you have integrated many tools, it's not easy.

with regards Ramkyeri SINEFinder2.zip

DerKevinRiehl commented 1 year ago

Dear Ramky, thanks for your answer.

Looking forward to hear from you if this did the trick and you could apply sinefinder.

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Thanks for your kind reply. I will update you the results. with regards Ramky

Ramkyeri commented 1 year ago

Dear Kevin, I tried this 'SINEFinder2.py' with my sequence file and it called more than 1 sequence; however, I stopped it after calling 100 sequences because I do not know the runtime. Then I tried it with your demo sequences, but 'SINEFinder2.py' did not call the reverse complement sequence 'sequence_rc.fasta'. When I copied the reverse complementary sequence to the same folder, it also worked. Your Python script, on the other hand, calls both strands and stores the results in two folders, sinefind (for the complementary sequence) and sinefind_rc (for the reverse complement sequence). I compared the results of the both Python scripts, and both Python scripts showed similar results for the complementary sequence and for the reverse complementary sequence.

However, I also found that in your demo results file, the results might be copied two times, unknowingly. But I am not sure aout this.

This was from SINEFinder2.py

seq1 F 692852:692997 TSD-len=11;TSD-score=10;TSD-mism=1 CATCATGGATGattgttcttgcacaaatagtatggcAATGGagcaatggatgtctgagtattgtatacttaaccctcaatcgaaccGTTCGAaatgcagtcatcaaaatgctgattccAAAAAAgatccgactaCGTCATGGATG seq2 F 355843:356361 TSD-len=11;TSD-score=10;TSD-mism=1 GAATTGTGAAAattatgctgAATGGgaaaagaagtggcttgtaagatgtaaGTTCAAgttattataattaatcatagatatcattacaaaataactgtctaaaagtttcagatactatgataaatgcccaaggaactggtactgaaacacttttgcaaattacttgtagttgtaagaccttctactctgtcaaactgccgaataagtttgcttcttgccctggatgcaaaatcaatataaacacggagaccatgaactgtgtttattatccatactattatccatatcccgcaaggtatcgcgagcaacagagaagctatgaccaatctttaaagtatgccgataatggaccttattgtccagttgctacagtctatactgatatccgcttgattcctgcaatcaaagattcggtaatgagaatatgtgcacggacaagagatctacgacttgacatcaaacttcggaatcgTTTTTTgaaggctgaacaattattgattaaaaatgGAATTCTGAAG seq2 F 405721:405846 TSD-len=15;TSD-score=12;TSD-mism=3 TGGATTATTAAGTTTtcAATGGaaaaatgtctgaaaaatttacaaatcatGTTCAAaatcaataatcccaataaggttatccgAAAAAAAAAAcgcggaaatgttgaagcTAGGTTATTAATTTT seq2 F 487464:487588 TSD-len=13;TSD-score=11;TSD-mism=2 TTCAAAAGTCTACtcggaaGCTGGtggacgtgttgatgtgatgcttacggctatcgaatcggcGTTCAAttcctattgggatccttttgaggtttgttgattAAAAAAAtaTACAAAAATCTAC

This was from your script

seq1 F 692852:692997 TSD-len=11;TSD-score=10;TSD-mism=1 CATCATGGATGattgttcttgcacaaatagtatggcAATGGagcaatggatgtctgagtattgtatacttaaccctcaatcgaaccGTTCGAaatgcagtcatcaaaatgctgattccAAAAAAgatccgactaCGTCATGGATG seq2 F 355843:356361 TSD-len=11;TSD-score=10;TSD-mism=1 GAATTGTGAAAattatgctgAATGGgaaaagaagtggcttgtaagatgtaaGTTCAAgttattataattaatcatagatatcattacaaaataactgtctaaaagtttcagatactatgataaatgcccaaggaactggtactgaaacacttttgcaaattacttgtagttgtaagaccttctactctgtcaaactgccgaataagtttgcttcttgccctggatgcaaaatcaatataaacacggagaccatgaactgtgtttattatccatactattatccatatcccgcaaggtatcgcgagcaacagagaagctatgaccaatctttaaagtatgccgataatggaccttattgtccagttgctacagtctatactgatatccgcttgattcctgcaatcaaagattcggtaatgagaatatgtgcacggacaagagatctacgacttgacatcaaacttcggaatcgTTTTTTgaaggctgaacaattattgattaaaaatgGAATTCTGAAG seq2 F 405721:405846 TSD-len=15;TSD-score=12;TSD-mism=3 TGGATTATTAAGTTTtcAATGGaaaaatgtctgaaaaatttacaaatcatGTTCAAaatcaataatcccaataaggttatccgAAAAAAAAAAcgcggaaatgttgaagcTAGGTTATTAATTTT seq2 F 487464:487588 TSD-len=13;TSD-score=11;TSD-mism=2 TTCAAAAGTCTACtcggaaGCTGGtggacgtgttgatgtgatgcttacggctatcgaatcggcGTTCAAttcctattgggatccttttgaggtttgttgattAAAAAAAtaTACAAAAATCTAC

the results might be copied two times

>seq1 F 692852:692997 TSD-len=11;TSD-score=10;TSD-mism=1 CATCATGGATGattgttcttgcacaaatagtatggcAATGGagcaatggatgtctgagtattgtatacttaaccctcaatcgaaccGTTCGAaatgcagtcatcaaaatgctgattccAAAAAAgatccgactaCGTCATGGATG >seq2 F 355843:356361 TSD-len=11;TSD-score=10;TSD-mism=1 GAATTGTGAAAattatgctgAATGGgaaaagaagtggcttgtaagatgtaaGTTCAAgttattataattaatcatagatatcattacaaaataactgtctaaaagtttcagatactatgataaatgcccaaggaactggtactgaaacacttttgcaaattacttgtagttgtaagaccttctactctgtcaaactgccgaataagtttgcttcttgccctggatgcaaaatcaatataaacacggagaccatgaactgtgtttattatccatactattatccatatcccgcaaggtatcgcgagcaacagagaagctatgaccaatctttaaagtatgccgataatggaccttattgtccagttgctacagtctatactgatatccgcttgattcctgcaatcaaagattcggtaatgagaatatgtgcacggacaagagatctacgacttgacatcaaacttcggaatcgTTTTTTgaaggctgaacaattattgattaaaaatgGAATTCTGAAG >seq2 F 405721:405846 TSD-len=15;TSD-score=12;TSD-mism=3 TGGATTATTAAGTTTtcAATGGaaaaatgtctgaaaaatttacaaatcatGTTCAAaatcaataatcccaataaggttatccgAAAAAAAAAAcgcggaaatgttgaagcTAGGTTATTAATTTT >seq2 F 487464:487588 TSD-len=13;TSD-score=11;TSD-mism=2 TTCAAAAGTCTACtcggaaGCTGGtggacgtgttgatgtgatgcttacggctatcgaatcggcGTTCAAttcctattgggatccttttgaggtttgttgattAAAAAAAtaTACAAAAATCTAC****

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin, The original name of my sequences starts with > moso_draft_hic_scaffold_1, which is replaced by seq1 by your tool. I checked the demo file GFF where it is seq 1, your original file is also seq 1, what will be the sequence name in case of my file, but I do not know when my analysis will be finished.

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

Ramkyeri commented 1 year ago

Dear Kevin,

mitefind status shows completed, but in the both folder mitefind and mitefind_rc, the results were 0 KB.

I do not know what is the reason.

Thank you in advance for your kindness, please give your suggestion

with regards Ramkyeri

DerKevinRiehl commented 1 year ago

Dear Ramky, the renaming of the sequences is stored in sequence_heads.txt so you can match with sequence is related to what.

If mitefind results in empty files, maybe because mitefinder didnt find anything in your bambo genomes i am afraid...

Best, Kevin

Ramkyeri commented 1 year ago

Dear Kevin, Many thanks for your kind reply. I will update you the results. with regards Ramky

DerKevinRiehl / transposon_annotation_tools

> parse RepeatMasker in Docker container #8

nested repeats are produced by ID colum in repeatmasker GFF file.