Closed blavetn closed 3 years ago
Dear Nicolas, thank you for your feedback again.
#Part1 By trying to install separately the different tools, I have noticed that mustv2 needed blast-legacy to be installed to work properly, and sinescan needed to be installed with python 2.7. I will work on it and come back to you soon (see #Part5). I just have following questions to you: 1) So this was only related to mustv2 and sinescan, or did you have issues with any other package during installation? 2) Did you, after installing, use the demo.fasta from this github (https://github.com/DerKevinRiehl/transposon_annotation_tools) ?
#Part2 I have installed and run the different tools independently on your demo dataset but I haven't got any results from transposonPSI and proteinNCBICDD1000 which make fail the command: reasonaTE -mode pipeline -projectFolder workspace -projectName testProject Well the error message shows, that there is no file "transposonPSI.gff3" in "parsedAnnotations". So, according to the tutorial (https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE), you managed to do Step 1), create a project. Then you did Step 2) Annotate genome with annotation tools. How did you exactly do this? (I mention four options how to do it) Which option did you chose?
#Part3 What output do you get when running "Check status of annotation tools" ?:
conda activate transposon_annotation_tools_env
reasonaTE -mode checkAnnotations -projectFolder workspace -projectName testProject
#Part4 Did you try to install the annotation tools separately because the installation outlined in the tutorial of reasonaTE (https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE) didnt work or because you just wanted to use them separately?
#Part5 - Sinescan Updates I updated sinescan. Even though I declare the dependency to python=2.7 in sinescan, conda selected older versions (without explicit mentioning of python=2.7) as it tries to avoid environment solving efforts. If I force it to use the sinescan package with the python=2.7 dependency, it solves environment forever. As it turns out, conda has troubles solving the environment due to large package dependencies. Therefore, I recommend mamba to install sinescan. Please have a look on the tutorial page again: https://github.com/DerKevinRiehl/transposon_annotation_tools
conda install -y mamba
mamba install -y python=2.7 # if not done before of mentioned while creating the environment
mamba install -y -c derkevinriehl transposon_annotation_tools_sinescan=1.1.2
#Part5 - Mustv2 I cannot reproduce your error. If I use a plain, clean ubuntu system and try to install the mustv2 conda package as it is, it works fine for me. Could you please create a new environment and try again. If it fails again, could you provide "conda list" output, so that I can see your packages? (Please also provide the conda list for your base environment). I will ask some colleagues to reproduce this error.
Your user feedback and issues are highly appreciated to improve my software, Thank you very much, I am looking forward to your answers, Best regards, Kevin
Dear Kevin
As I have installed the tools in different environments. helitronScanner, genometools (for ltrHarvest and tirvish), mitefinder, mitetracker, mustv2, repeatmodeler, repeatmasker, sinefinder, sinescan, transposonpsi, proteinncbicdd1000 and reasonate. First I activated reasonate and run the command to create the working directory:
mkdir workspace wget https://raw.githubusercontent.com/DerKevinRiehl/transposon_annotation_reasonaTE/main/workspace/testProject/sequence.fasta # demo fasta you could use reasonaTE -mode createProject -projectFolder workspace -projectName testProject -inputFasta sequence.fasta
but then I have run separately each tools but not using the command reasonaTE. So for exemple I did:
cd workspace/testProject/tirvish gt suffixerator -db ../sequence.fasta -indexname sequence.index -tis -suf -lcp -des -ssp -sds -dna -mirrored gt tirvish -index sequence.index > result.txt
and so on for all the tools.
reasonaTE -mode checkAnnotations -projectFolder workspace -projectName testProject Checking helitronScanner ... completed Checking ltrHarvest ... completed Checking ltrPred ... not completed Checking mitefind ... completed Checking mitetracker ... completed Checking must ... completed Checking repeatmodel ... not completed Checking repMasker ... completed Checking sinefind ... not completed Checking sinescan ... completed Checking tirvish ... completed Checking transposonPSI ... not completed Checking NCBICDD1000 ... not completed
I know that some tools give "not completed" but I run all of them except ltrPred which I couldn't manage to install, but they give me no output.
It is by choice that I have installed the program separately. But I have also tried to install them from the environment file, there was no problem, it is just in the future I plan to use one tool at a time probably in a snakemake pipeline, and it is better for me to have them separately than all in one environment. For exemple when I used the environment file, sinescan and mustv2 did not make trouble, because in your environment you are asking for python 2.7 and blast-legacy. but when I try to install mustv2 alone in is own environment, blast-legacy is not in the dependency (tried now on another system)
conda create -n mustv2 -c derkevinriehl transposon_annotation_tools_mustv2
...
environment location: /home/niko/anaconda3/envs/mustv2
added / updated specs:
The following packages will be downloaded:
package | build
---------------------------|-----------------
blast-2.11.0 | pl526he19e7b1_0 20.8 MB bioconda
blat-36 | 0 699 KB bioconda
c-ares-1.17.1 | h7f98852_1 109 KB conda-forge
cairo-1.16.0 | h6cf1ce9_1008 1.5 MB conda-forge
curl-7.76.1 | h979ede3_1 149 KB conda-forge
entrez-direct-13.9 | pl5262he881be0_2 5.2 MB bioconda
expat-2.3.0 | h9c3ff4c_0 168 KB conda-forge
fontconfig-2.13.1 | hba837de_1005 357 KB conda-forge
gdk-pixbuf-2.42.6 | h04a7f16_0 609 KB conda-forge
graphviz-2.47.1 | hebd9034_0 4.0 MB conda-forge
Total: 127.6 MB
blast-legacy is not present and cause mustv2 to fail.
conda activate mustv2
cd must mkdir temp mustv2 ../sequence.fasta result.txt temp Spliting the genome into 10 sub-genomes ... [SubFiles:10] [done] [Elapsed time: 0 seconds] Scanning the nucleotide sequences for potential MITEs ... Total calculation time elapsed: 26 seconds. Loading the annotation data ... [RawMITEs:2480] [done] [Time:0 seconds] Removing redundancy in the predicted MITEs ... [RuleBasedRemoving:10] [MITE:2470] [done] [Time:0 seconds] Clustering ... sh: 1: formatdb: not found sh: 1: megablast: not found [BLAST]EE! ---[temp/temp-mite-seq.all-vs-all.megablast.txt]---
similarly, for sinescan the problem was the need to run with python 2.7, I am trying now to reinstall it but it is quite slow
Thank you for your fast answers
Regards Nicolas
I have tried reinstalling sinescan and it install with python 2.7, so it is fine now. cd sinescan/ mkdir result mkdir output mkdir final sinescan -s 123 -g ../sequence.fasta -o output -d result -z final perl /home/niko/anaconda3/envs/sinescan/bin/SINE_Scan-v1.1.1/SINE_Scan_process.pl -s 123 -g ../sequence.fasta -o output -d result -z final result exists. clean it. Step One: Run SINE-Finder. zhanjing zhanjing Finish Step One. Time cost for SINE-Finder Module One:55 wallclock secs ( 0.01 usr 0.00 sys + 93.12 cusr 0.19 csys = 93.32 CPU) Build BlAST database for the genomic sequences.
Building a new DB, current time: 04/27/2021 09:24:08 New DB name: /home/niko/test/sinescan/result/sequence.fasta New DB title: result/sequence.fasta Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 2 sequences in 0.0242441 seconds.
Your genomic dataset has no reasonable SINE candidates
Dear Nicolas,
# Update MustV2 package Okay I updated Mustv2 package now with the dependency blast-legacy. Would be great if you could try to reinstall and check if it works this time.
# Part 2: but then I have run separately each tools but not using the command reasonaTE. So for exemple I did: cd workspace/testProject/tirvish gt suffixerator -db ../sequence.fasta -indexname sequence.index -tis -suf -lcp -des -ssp -sds -dna -mirrored gt tirvish -index sequence.index > result.txt and so on for all the tools.
Okay. This way of using the software refers to "Option 4" in Step2 of the reasonate Tutorial: https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE I think the tricky thing here is you need to run the softwares and make sure their output goes to the samefolder with the same file name, like in the demoFolder of the reasonaTE project. https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/tree/main/workspace/testProject
I remember you wanted to run the softwares each one by one separately to determine specific parameters, for this purpose (see your last GitHub Issue) I updated reasonaTE. ReasonaTE calls the tools and takes care the output files are stored in the right folders with the right names so that other parts of the pipeline can find them properly. Therefore I still recommend using reasonaTE to call the tools if you want to use the tools in combination with reasonaTE. I offer four options to call the tools now^^ (see tutorial).
However, as I think you are an advanced / expert level user of all these software tools anyway, one thing I can share with you is the following: If you want to run the tools completely by yourself and make sure the output files land in the right folder and have the right name, I recommend you to check how reasonaTE calls them via CLI. You can find these information in this script: https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/blob/main/Code/AnnotationCommander.py
# Part 3: Yes, if some are "not completed" thats totally fine. Its just that the protein annotations from transposonPSI and NCBICDD1000 are mandatory. According to your output they are not found. Could maybe outline into more detail how you run transposonPSI or NCBICDD1000 to annotate transposon characteristic proteins? Maybe you could also show me the output files in the folders? (You said there are no output files at all, maybe they are, just wrongly named? Is there any error message? Normally the softwares should generate output files...) Would be very helpful to get some information from you on this!
#Part4 It is by choice that I have installed the program separately. But I have also tried to install them from the environment file, there was no problem, it is just in the future I plan to use one tool at a time probably in a snakemake pipeline, and it is better for me to have them separately than all in one environment.
I get your point. As some software tools are quite slow, it makes sense to run them in parallel using for example snakemake pipeline. I am well aware the environment is quite large, but dont you think you could also call the tools in snakemake using one environment and the reasonaTE command to call the tools now? Also I found, that mamba is a valuable addition as it can install packages much faster than conda does.
Slow sinescan similarly, for sinescan the problem was the need to run with python 2.7, I am trying now to reinstall it but it is quite slow
Yes. Luckily you found few up to none Sines, as they rarely exist in C Elegans^^. I agree, sinescan is one of the slower softwares. You have to consider that I use others software, but I didnt write / optimize their resource and speed consumption. If you apply these tools to very large FASTA files you will experience, that especially RepMasker and RepModeler take most of the time. Therefore, once again, I recommend you to parallelize using snakemake, not call the tools one by one. Also, there is no necessity to run all tools, its just the more tools, the more knowledge of different tools is combined.
Update of Tutorial for reasonaTE I updated the tutorial on how to install reasonaTE using mamba. Feel free to check it out, installation should be much faster.
Best Regards and looking forward hearing back from you soon, Kevin
Mustv2 is fixed thank you.
I am just testing so I can still decide how I would include the tools in a snakemake pipeline. For example I have retried to run the different tools using the reasonaTE command in 2 ways, with the all option and separately for each tools. That way it is working fine, except that the RepeatMasker have no results. even with the following command: reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool repMasker xxxxx -lib worm_repDB.fasta RepeatMasker version 4.1.1 Search Engine: NCBI/RMBLAST [ 2.10.0+ ]
Using Master RepeatMasker Database: /mnt/ssd/ssd_1/conda_envs/nicolas_reasonate_all/share/RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Version : Date : Families :
Species "homo sapiens" is not known to RepeatMasker. There may not be any TE families defined in the libraries for this species/clade or there may be an error in the spelling. Please check your entry against the NCBI Taxonomy database and/or try using a broader clade or related species instead. The full list of species/clades defined in the library may be obtained using the famdb.py script.
Annotation by software repMasker finished successfully...
it looks that in this case the "xxxxx -lib worm_repDB.fasta" was not parsed to be used by RepeatMasker so we fall to the problem that RepeatMasker need to be configured before use.
About proteinNCBICDD1000 : I did : cd /workspace/testProject/NCBICDD1000 mkdir result proteinNCBICDD1000 -fastaFile ../sequence.fasta -resultFolder result
but I just found out that to call the resultFolder "temp" is fixing the checking , so now : Checking NCBICDD1000 ... completed
About transposonPSI : I did : cd /workspace/testProject/transposonPSI mkdir temp mkdir result transposonPSI -fastaFile ../sequence.fasta -resultFolder result -tempFolder temp -mode nuc
the output are in result and are not found when running the checking.
but if I do: mkdir temp transposonPSI -fastaFile ../sequence.fasta -resultFolder ./ -tempFolder temp -mode nuc
it fix the problem: Checking transposonPSI ... completed
I have a comment regarding the fact that for few tools you are copying the sequence.fasta in the folder of the tool (sinefind, repMasker). I think that it would be better that you create symbolic link instead because with large genome, to copy several time the file could take a lot of space. Also I would like to know why when we create the project, you create a fasta file with renamed sequences.
Thanks for creating those packages
Nicolas
after running independently RepeatMasker, I manage to finish to run reasonaTE, but I can find gff3 in transposonCandA folder which do not have renamed sequence name. Where could I find a final gff file with correct sequence name ?
Dear Nicolas, thanks once again for your feedback and suggestions.
#Calling RepeatMasker and RepeatModeler Could you share your command line code that you used to run RepeatMasker independently (same conda environment or another?) I just checked the code again, if you write something after the five xxxxx, it should definetly be used calling the CLI (command line interface) I recommend the user to install RepeatModeler onto the system so it is not specific to a conda environment. Repeatmodeler and Repeatmasker installation in the context of conda are known to cause problems (see many examples in forums on the internet)
#Copy files I have a comment regarding the fact that for few tools you are copying the sequence.fasta in the folder of the tool (sinefind, repMasker). I think that it would be better that you create symbolic link instead because with large genome, to copy several time the file could take a lot of space.
So in fact some tools need to have the sequence file in the current working directory when runngin, therefore sometimes reasonaTE copies it. Also, some tools only annotate one strand, therefore the reverse complement is created and annotated explicitly by some tools as well.
#Fasta files with renamed sequences Also I would like to know why when we create the project, you create a fasta file with renamed sequences.
So according to the standard of FASTA files, header names cannot include tab symbols. Later, in annotation files such as GFF3, the tab symbol is used for separating columns, therefore it is necessary to make sure the genome sequence names (for example of different chromosomes) do not contain tabs. This is why it is renamed. However, there is a file in the project outlining the renaming: https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/blob/main/workspace/testProject/sequence_heads.txt So if you want the original names (that cannot contain tabs or spaces) than you can based on this sequence_heads.txt rename them back if you want.
after running independently RepeatMasker, I manage to finish to run reasonaTE, but I can find gff3 in transposonCandA folder which do not have renamed sequence name. Where could I find a final gff file with correct sequence name ?
So if you run all steps of reasonaTE (did you do all the steps?) then the final result will be stored to the folder "finalResults" and not "transposonCandA". https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/tree/main/workspace/testProject/finalResults Besides, you can find the generated statistics in the main workspace folder: https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/blob/main/workspace/testProject/Statistics_FinalAnnotations.txt and https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/blob/main/workspace/testProject/Statistics_ToolAnnotations.txt
I hope my answers could help somehow. Please feel free to answer and raise more questions, or reach out for more suggestions.
Thanks, Best regards, Kevin
Personally I have not get trouble with RepeatModeler installed from conda. Using a conda environment created using transposon_annotation_tools_env.yml. I have created 2 projects to test reasonaTE with the demo data. First : reasonaTE -mode createProject -projectFolder workspace -projectName testProject2 -inputFasta sequence.fasta reasonaTE -mode annotate -projectFolder workspace -projectName testProject2 -tool all reasonaTE -mode checkAnnotations -projectFolder workspace -projectName testProject2
Checking helitronScanner ... completed Checking ltrHarvest ... completed Checking ltrPred ... not completed Checking mitefind ... completed Checking mitetracker ... completed Checking must ... completed Checking repeatmodel ... completed Checking repMasker ... not completed Checking sinefind ... completed Checking sinescan ... completed Checking tirvish ... completed Checking transposonPSI ... completed Checking NCBICDD1000 ... completed
Second: reasonaTE -mode createProject -projectFolder workspace -projectName testProject3 -inputFasta sequence.fasta reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool helitronScanner reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool ltrHarvest reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool mitefind reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool mitetracker reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool must reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool repeatmodel reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool repMasker xxxxx -lib worm_repDB.fasta reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool sinefind reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool sinescan reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool tirvish reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool transposonPSI reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool NCBICDD1000 reasonaTE -mode checkAnnotations -projectFolder workspace -projectName testProject3
Checking helitronScanner ... completed Checking ltrHarvest ... completed Checking ltrPred ... not completed Checking mitefind ... completed Checking mitetracker ... completed Checking must ... completed Checking repeatmodel ... completed Checking repMasker ... not completed Checking sinefind ... completed Checking sinescan ... completed Checking tirvish ... completed Checking transposonPSI ... completed Checking NCBICDD1000 ... completed
So in both cases RepeatMasker fail.
Here is the error : reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool repMasker RepeatMasker version 4.1.1 Search Engine: NCBI/RMBLAST [ 2.10.0+ ]
Using Master RepeatMasker Database: /mnt/ssd/ssd_1/conda_envs/nicolas_reasonate_all/share/RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Version : Date : Families :
Species "homo sapiens" is not known to RepeatMasker. There may not be any TE families defined in the libraries for this species/clade or there may be an error in the spelling. Please check your entry against the NCBI Taxonomy database and/or try using a broader clade or related species instead. The full list of species/clades defined in the library may be obtained using the famdb.py script.
Annotation by software repMasker finished successfully...
It fail the same with the command: reasonaTE -mode annotate -projectFolder workspace -projectName testProject3 -tool repMasker xxxxx -lib worm_repDB.fasta which should use a different database.
But if I go to the folder repMasker and run (I haven't change the conda environment): cd workspace/testProject3/repMasker RepeatMasker -pa 10 -lib ../../../worm_repDB.fasta sequence.fasta RepeatMasker version 4.1.1 Search Engine: NCBI/RMBLAST [ 2.10.0+ ] Using Custom Repeat Library: ../../../worm_repDB.fasta
Building general libraries in: /mnt/ssd/ssd_1/conda_envs/nicolas_reasonate_all/share/RepeatMasker/Libraries//general
Traceback (most recent call last):
File "/mnt/ssd/ssd_1/conda_envs/nicolas_reasonate_all/share/RepeatMasker/famdb.py", line 51, in
analyzing file sequence.fasta identifying Simple Repeats in batch 1 of 23 identifying Simple Repeats in batch 2 of 23 identifying Simple Repeats in batch 4 of 23 identifying Simple Repeats in batch 5 of 23 identifying Simple Repeats in batch 6 of 23 identifying Simple Repeats in batch 9 of 23 .... identifying Simple Repeats in batch 21 of 23 identifying matches to worm_repDB.fasta sequences in batch 17 of 23 identifying Simple Repeats in batch 17 of 23 processing output: cycle 1 .. cycle 2 .. cycle 3 .. cycle 4 .. cycle 5 cycle 6 .. cycle 7 .. cycle 8 . cycle 9 . cycle 10 . Generating output... . masking done
So as it can run with the installed RepeatMasker from conda, I am convince that the parsing with xxxxx is not working in this case.
I think that would be better that reasonaTE is not copying the sequence.fasta file in the directory where the file is needed but rather creates symbolic link to avoid duplication of large files.
TAB or space (I think even pipes | ) in header of fasta file are not part of the accession name of the sequence which should be unique, so even if you would have a sequence named:
chr1 species:human chr:1:ref:xxxxx
the accession of the sequence is chr1 the rest is description and if you look at gff files this is the name that appear in the first column.
again if there would be not the need to create a copy of the original sequence file with renamed accession, I think it would reduce space of the working directory.
Thank you for pointing out the final results !
Regards
Nicolas
Dear Nicolas, thanks for your points and open feedback.
The behavior of RepeatMasker and RepeatModeler is reported differently by different users. I think the most important is that you found a way to include their outputs to reasonaTE.
I agree, the renaming of the sequence files and the copying of the files is something that should be optimized in the next update. Therefore, I included your feedback in the latest update. Please find v1.0.3 of reasonaTE and the updated tutorial. You can now rename any GFF back to the original sequence names by using Mode 8 of reasonaTE. https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE
Best regards, Kevin
@blavetn
Dear Nicolas,
mustv2 issued the following errors.
Spliting the genome into 10 sub-genomes ... [SubFiles:10] [done] [Elapsed time: 0 seconds] Scanning the nucleotide sequences for potential MITEs ... Total calculation time elapsed: 146 seconds. Loading the annotation data ... [RawMITEs:54584] [done] [Time:3 seconds] Removing redundancy in the predicted MITEs ... [RuleBasedRemoving:3] [MITE:54581] [done] [Time:3 seconds] Clustering ... [BLAST] [done] [ClusterNum:1492] [Time:378 seconds] Deciding the strand information of each valid MITE ... [done] [Time:4 seconds] Saving the data ... [done] [ValidMITENum:6526] [Time:12 seconds] Screening for all the MITE copies ... [BackupMITEs:2078] [Templates:2078] [BLAT] [OtherCopy:1496] [AllCopySaved:3574]"/home/ram/transposon_annotation_tools_env/bin/MUST.r2-4-002.Release/refineBoundary.pl" "temp" 0.8 "temp/temp-genome-seq.fasta" "result.txt.be-filter1" "result.txt.be-filter2" -Error when refineBoundary!
while using demo file, there is no error.
Please send me your suggestion.
with regards
Ramky
Dear Kevin
By trying to install separately the different tools, I have noticed that mustv2 needed blast-legacy to be installed to work properly, and sinescan needed to be installed with python 2.7.
I have installed and run the different tools independently on your demo dataset but I haven't got any results from transposonPSI and proteinNCBICDD1000 which make fail the command: reasonaTE -mode pipeline -projectFolder workspace -projectName testProject
Traceback (most recent call last): File "reasonate/share/TransposonAnnotator_reasonaTE/TransposonAnnotator.py", line 165, in
createToolAnnotation_Files(os.path.join(arg1,arg2), os.path.join(arg1,arg2,"finalResults"), os.path.join(arg1,arg2,"parsedAnnotations"), os.path.join(arg1,arg2,"transposonCandB"), os.path.join(arg1,arg2,"transposonCandF"))
File "reasonate/share/TransposonAnnotator_reasonaTE/FinalResultsCreator.py", line 60, in createToolAnnotation_Files
mergeAnnotations([os.path.join(folderParsedAnnotations,"transposonPSI.gff3"),os.path.join(folderParsedAnnotations,"NCBICDD1000.gff3")], os.path.join(folderParsedAnnotations,"proteinfeatures.gff3"))
File "reasonate/share/TransposonAnnotator_reasonaTE/GFFTools.py", line 86, in mergeAnnotations
f = open(fileIn,"r")
FileNotFoundError: [Errno 2] No such file or directory: 'workspace/testProject/parsedAnnotations/transposonPSI.gff3'
Any idea what could have caused that ?
Regards Nicolas