DerKevinRiehl / TransposonUltimate

TransposonUltimate - a holistic set of tools for transposon identification
GNU General Public License v3.0
81 stars 5 forks source link

​With low-threaded annotation work, can I split a large whole genome with 20 chromsome?​ #10

Closed Djangodu closed 1 year ago

Djangodu commented 1 year ago

Dear Kevin

First, it must to say that your work is very brilliant. These days i try my best to build the evironment about it, and install them by your guidance with every Step.

Although, there are some errors i can't deal with.

After install the sofewares, i downloaded the sequence.fasta you suggested, and running with the protocal of transposon_annotation_reasonaTE, everything is smoothly but two errors occur while running the Step 4, and i don't know how it is happed and deal with it.

For the first ERROR, i considered the outfiles formats not right, but don't know why present that, actually, i do everything the illustration recommand steps.

For the second ERROR, i compared your testProject file with mine, i just found two files in yours,

PipelineAnnotations_TransposonSequencesClasses.txt ToolAnnotations_TransposonSequencesClasses.txt

                                      but three temp-files in mine: 

ToolAnnotations_TransposonSequencesClasses.txt.featuresA_20230827_134213.temp ToolAnnotations_TransposonSequencesClasses.txt.featuresA_20230827_144626.temp ToolAnnotations_TransposonSequencesClasses.txt.featuresB_20230827_134213.temp

these two ERRORs:

ValueError_:` node array from the pickle has an incompatible dtype:

Step 4) Run the pipeline on the genome annotations


conda activate transposon_annotation_reasonaTE
reasonaTE -mode pipeline -projectFolder workspace -projectName testProject

### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### 
### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### 
### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### 

After the second error, this step was not continue for other work and be killed.

I consider it may not continue to do other steps if this two errors not be corrected. Please have a look and try to help me with these  two errors.

Looking forward to your reply
Sincerely yours
DerKevinRiehl commented 1 year ago

Dear Djangodu, thank you very much for your interest in our work. Could you share how you setup your environment and which python version you use?

Thank you very much, Best, Kevin

Djangodu commented 1 year ago

Dear Kevin

Thanks for your reply The environment i setup by the transposon_annotation_reasonTE README.md recommand, use mamba and conda to install it. And the python version is 2.7.

Looking forward to your reply Sincerely yours

Djangodu commented 1 year ago

Dear Djangodu, thank you very much for your interest in our work. Could you share how you setup your environment and which python version you use?

Thank you very much, Best, Kevin

Actually, from my results, there also have some traceback were showed in the picture i display, is it normal condition while running in this step?

Djangodu commented 1 year ago

Dear Kevin

​I used your xml file to successfully create the environment. Thank you very much for your donation​.

​Although, there is another puzzle, the genome sequence is very large and I would like to run with your annotated tool at least 3.0 GB. So is it a good idea to separate them with a single chromosome? ​Because it would take so long with some of the special software in this annotated work with this big genome.

​But, I am not sure that this operation is appropriate for whole genome annotation. I know that classification work will run after some software annotation, but if splitting the whole genome sequence into twenty pieces creates twenty results, another puzzle is how to combine them together?

And is it right to do that? OR I must do this annotation work with the whole genome sequence but not split it? ​Although, there is another puzzle, the genome sequence is very large and I would like to run with your annotated tool at least 3.0 GB. So is it a good idea to separate them with a single chromosome? ​Because it would take so long with some of the special software in this annotated work with this big genome.

​But, I am not sure that this operation is appropriate for whole genome annotation. I know that classification work will run after some software annotation, but if splitting the whole genome sequence into twenty pieces creates twenty results, another puzzle is how to combine them together?

​And is it the right thing to do? Or do I have to do this annotation work with the whole genome sequence but not split it?

​If I have to do this with a whole genome sequence, is there some way to add more threads but not 1 or 2 threads? After all, the lower thread pull has the lowest effect with big genome.

Looking forward to your reply Sincerely yours

DerKevinRiehl commented 1 year ago

Hello Djangodu, I think, depending on your computational capacities, to split up the genome and to analyse it seperately. For this you need to create separate projects, as one project can only contain one sequence.fasta file.

It is not a problem at all, as you split them not randomly but by chromosome. If you would cut these chromosomes into pieces it would still be valid, its just you would risk to miss transposons at the cutting place.

You could try to run separate projects in parallel in different threads on your linux machine, and then it should be totally fine.

After you finished everything you could use common bioinformatic tools to combine separate annotation files to a whole again. You could also just open a text editor and concatenate the annotation files yourself if you are not able to write a programme or use bioinformatic tools. Just have a look with "Notepad++" inside. You just need to rename the sequence to the name of your chromosomes before putting all together again. You could just replace "seq1" with the name of the chromosome and thats it.

seq1    reasonaTE   transposon  29700   29860   .   +   .   transposon=251;class=2/1/2(hAT,TIR,DNATransposon)
seq1    reasonaTE   transposon  34769   34942   .   +   .   transposon=252;class=2/1/2(hAT,TIR,DNATransposon)
seq1    reasonaTE   transposon  86291   87619   .   +   .   transposon=253;class=2/1/5(Zator,TIR,DNATransposon)
seq1    reasonaTE   transposon  88210   89547   .   +   .   transposon=254;class=2/1/5(Zator,TIR,DNATransposon)
seq1    reasonaTE   transposon  178154  182145  .   +   .   transposon=255;class=2/1/5(Zator,TIR,DNATransposon)
seq1    reasonaTE   transposon  199518  199683  .   +   .   transposon=256;class=2/1/2(hAT,TIR,DNATransposon)
seq1    reasonaTE   transposon  200395  200562  .   +   .   transposon=257;class=2/1/4(Sola,TIR,DNATransposon)
seq1    reasonaTE   transposon  214649  214819  .   +   .   transposon=258;class=2/1/3(CMC,TIR,DNATransposon)

Hope that could help a little. I just tell you 3GB is a lot and you better search a cluster to do that. Please keep me updated in case you have progress or need any help. Best regards, Kevin

Djangodu commented 1 year ago

Dear Kevin

​Thank you very much for your patient responses to my questions, these suggestions are very important to me.

Best wish Sincerely yours