GonzalezLab / T-lex3

GNU General Public License v3.0
14 stars 10 forks source link

check pre-requisites #4

Open AnneJRomero opened 1 year ago

AnneJRomero commented 1 year ago

Is there a quick way to check all the pre-requisites were installed correctly and tlex3 will run properly?

m-bogaerts commented 1 year ago

Good afternoon,

You can find an example folder with toy files to check if everything is properly installed.

Thank you!

AnneJRomero commented 1 year ago

Hi,

Thank you for your reply.

I get the following output, can you please help me on what is wrong?

Job Output Follows ...

"my" variable @ref masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2088. "my" variable $data masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2089. "my" variable $wd masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2102.


                                                     * T-lex release 3
                            Report the presence/absence of given sequence(s) in strain(s) *
                                             and return their frequency
                                                * Wed Sep 14 20:08:07 2022 *

Simplify the fasta file of the reference sequences ...

                             ******************** Tjunction analysis ********************

Identification of TE insertions nested or flanked by repeats.... RepeatMasker version 4.1.3 Search Engine: NCBI/RMBLAST [ 2.11.0+ ]

Using Master RepeatMasker Database: /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Dfam Version : 3.6 Date : 2022-04-12 Families : 19,025

Species "drosophila" is not known to RepeatMasker. There may not be any TE families defined in the libraries for this species/clade or there may be an error in the spelling. Please check your entry against the NCBI Taxonomy database and/or try using a broader clade or related species instead. The full list of species/clades defined in the library may be obtained using the famdb.py script.

mv: cannot stat 'tlex_exampledata/Tflank_checking_125.fasta.out': No such file or directory mv: cannot stat 'tlex_exampledata/Tflank_checking_125.fasta.masked': No such file or directory

Identification of TE insertions misannotated because of a longer Poly A/T tail....

Identification of TE insertions part of segmental duplications.... /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker blat tlex_exampledata/Tflank_checking_125.fasta tlex_exampledata/Tgenome.fasta tlex_exampledata/Tflank_checking_125.fasta.blast9 -out=blast9 Loaded 1260 letters in 10 sequences Query sequence 2L has size 23513712, it might take a while. Query sequence 2R has size 25286936, it might take a while. Query sequence 3L has size 28110227, it might take a while. Query sequence 3R has size 32079331, it might take a while. Query sequence X has size 23542271, it might take a while. Searched 137547960 bases in 7 sequences

No such file or directory at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 346. *FILTER TEs starts at Wed Sep 14 20:08:23 2022***

m-bogaerts commented 1 year ago

Hello Anne,

Could you do a screenshot of the Tanalysis folder to see what's inside? I understand the program stopped and you get no results, is it?

Regards.

AnneJRomero commented 1 year ago

Hi,

Here's the results folder: [ar14g12@cyan52 tlex_exampledata]$ ls Tanalysis Tflank_checking_125.fasta Tflank_checking_125.map Tgenome.fasta Tparam Tpoly_125.fasta Tpoly_125.map

Here's the Tanalysis folder: [ar14g12@cyan52 Tanalysis]$ ls Tflank_checking_125.fasta.blast9 Tflank_checking_125.fasta.blast9_sd Tpoly_125.fasta Tpoly_125.fasta.polyAT Tpoly_125.map

Thank you, Anne

m-bogaerts commented 1 year ago

Hello Anne,

The project name (-O) cannot contain the character "_". Could you try instead of calling it tlex_exampledata, but tlexexampledata, for example?

Thank you and sorry about the inconveniences.

AnneJRomero commented 1 year ago

Hi,

Thanks for the reply but I dont think that's the problem.

This is my script: /local/software/perl/5.26.1/bin/perl $tlex3/tlex-open-v3.0.pl -O exampledata -T $data/TElist_example.txt -M $data/TEannotation_example.txt -G $data/genome_example.fa -R $data/fastq_files/example/example_1.fastq $data/fastq_files/example/example_2.fastq

I use -O exampledata but the output folder comes out as tlex_exampledata.

m-bogaerts commented 1 year ago

Hello,

The architecture of this software is quite sensitive to small things, and sometimes it can be tricky (sorry about this), what makes a bit difficult to understand what can go wrong. Could you give as the project name (-O) "example"? I am pretty sure it is something related with file names, the rest of it looks fine.

Let me know and, sorry about the inconveniences!

AnneJRomero commented 1 year ago

Hi,

Thank you for your response.

I've tried the project name (-O) "example" but it is still giving me the same output.

Job Output Follows ...

"my" variable @ref masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2088. "my" variable $data masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2089. "my" variable $wd masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2102. mkdir: cannot create directory 'tlex_example': File exists


                                                     * T-lex release 3
                            Report the presence/absence of given sequence(s) in strain(s) *
                                             and return their frequency
                                                * Wed Sep 21 20:28:22 2022 *

Simplify the fasta file of the reference sequences ...

                             ******************** Tjunction analysis ********************

mkdir: cannot create directory 'tlex_example/Tanalysis': File exists Identification of TE insertions nested or flanked by repeats.... RepeatMasker version 4.1.3 Search Engine: NCBI/RMBLAST [ 2.11.0+ ]

Using Master RepeatMasker Database: /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Dfam Version : 3.6 Date : 2022-04-12 Families : 19,025

Species "drosophila" is not known to RepeatMasker. There may not be any TE families defined in the libraries for this species/clade or there may be an error in the spelling. Please check your entry against the NCBI Taxonomy database and/or try using a broader clade or related species instead. The full list of species/clades defined in the library may be obtained using the famdb.py script.

mv: cannot stat 'tlex_example/Tflank_checking_125.fasta.out': No such file or directory mv: cannot stat 'tlex_example/Tflank_checking_125.fasta.masked': No such file or directory

Identification of TE insertions misannotated because of a longer Poly A/T tail....

Identification of TE insertions part of segmental duplications.... /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker blat tlex_example/Tflank_checking_125.fasta tlex_example/Tgenome.fasta tlex_example/Tflank_checking_125.fasta.blast9 -out=blast9 Loaded 1260 letters in 10 sequences Query sequence 2L has size 23513712, it might take a while. Query sequence 2R has size 25286936, it might take a while. Query sequence 3L has size 28110227, it might take a while. Query sequence 3R has size 32079331, it might take a while. Query sequence X has size 23542271, it might take a while. Searched 137547960 bases in 7 sequences

No such file or directory at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 346. *FILTER TEs starts at Wed Sep 21 20:28:43 2022***

m-bogaerts commented 1 year ago

Hello,

I am afraid there is some kind of problem with RepeatMasker here: Species "drosophila" is not known to RepeatMasker.

Are you using any other library for RepeatMasker? Or maybe the library by default which it was installed with? RepeatMasker should get "drosophila" as a species and I am afraid it crashes when it tries to mask the genome, and therefore, the files do not exist after that.

Let me know if it is solved with this RM issue.

Regards!

AnneJRomero commented 1 year ago

Hi,

I installed RepeatMasker-4.1.3 and RepBase27.02. I didn't have any errors downloading these so RepeatMasker should work.

output from RepeatMasker installation: Building FASTA version of RepeatMasker.lib .............................. Building RMBlast frozen libraries.. The program is installed with a the following repeat libraries: File: /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/Libraries /Dfam.h5 Database: Dfam Version: 3.6 Date: 2022-04-12

Dfam - A database of transposable element (TE) sequence alignments and HMMs.

Total consensus sequences: 19025 Total HMMs: 18987

Thanks

m-bogaerts commented 1 year ago

Hello,

Sorry for my late response.

I think it could be a matter of the newest version in RepeatMasker: https://github.com/rmhubley/RepeatMasker/issues/123

Apparently you'd need to specify not only taxa but also species. It would need to be specified also in the manual.

AnneJRomero commented 1 year ago

Hi,

Thank you for the respose.

I'm not really sure how to fix this if I'm using RepeatMasker through the T-lex3 pipeline. Do I need to re-install RepeatMasker?

Thanks

m-bogaerts commented 1 year ago

Hello,

Sorry for the late response. Taking into account this new issue of RepeatMasker I think you should specify the argument -s using "drosophila_flies_genus".

According to your command line: /local/software/perl/5.26.1/bin/perl $tlex3/tlex-open-v3.0.pl -O exampledata -s 'drosophila_flies_genus' -T $data/TElist_example.txt -M $data/TEannotation_example.txt -G $data/genome_example.fa -R $data/fastq_files/example/example_1.fastq $data/fastq_files/example/example_2.fastq

Hope it helps!

AnneJRomero commented 1 year ago

Hi,

Thank you for your help!

This seems to work, I got the following output but the Tresults file is blank: "my" variable @ref masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2088. "my" variable $data masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2089. "my" variable $wd masks earlier declaration in same scope at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 2102.


                                                     * T-lex release 3
                            Report the presence/absence of given sequence(s) in strain(s) *
                                             and return their frequency
                                                * Mon Oct 10 17:50:54 2022 *

Simplify the fasta file of the reference sequences ...

                             ******************** Tjunction analysis ********************

Identification of TE insertions nested or flanked by repeats.... RepeatMasker version 4.1.3 Search Engine: NCBI/RMBLAST [ 2.11.0+ ]

Using Master RepeatMasker Database: /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Dfam Version : 3.6 Date : 2022-04-12 Families : 19,025

Species/Taxa Search: Drosophila <flies,genus> [NCBI Taxonomy ID: 7215] Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa; Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda; Arthropoda;Mandibulata;Pancrustacea;Hexapoda;Insecta; Dicondylia;Pterygota ;Neoptera;Endopterygota; Diptera;Brachycera;Muscomorpha;Eremoneura;Cyclorrhapha; Schizophora;Acalyptratae;Ephydroidea;Drosophilidae 16 families in ancestor taxa; 206 lineage-specific families

Building species libraries in: /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/Libraries/CONS-Dfam_3.6/drosophila_flies_genus Traceback (most recent call last): File "/mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/famdb.py", line 1841, in main() File "/mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/famdb.py", line 1834, in main args.func(args) File "/mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/famdb.py", line 1623, in command_families print_families(args, families, True, target_id) File "/mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/famdb.py", line 1584, in print_families print(entry, end="") UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 673: ordinal not in range(128)

analyzing file tlex_example/Tflank_checking_125.fasta

Checking for E. coli insertion elements identifying Simple Repeats in batch 1 of 1 identifying matches to drosophila_flies_genus sequences in batch 1 of 1 identifying Simple Repeats in batch 1 of 1 processing output: cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 cycle 9 cycle 10 Generating output... masking done

Identification of TE insertions misannotated because of a longer Poly A/T tail....

Identification of TE insertions part of segmental duplications.... /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker blat tlex_example/Tflank_checking_125.fasta tlex_example/Tgenome.fasta tlex_example/Tflank_checking_125.fasta.blast9 -out=blast9 Loaded 1260 letters in 10 sequences Query sequence 2L has size 23513712, it might take a while. Query sequence 2R has size 25286936, it might take a while. Query sequence 3L has size 28110227, it might take a while. Query sequence 3R has size 32079331, it might take a while. Query sequence X has size 23542271, it might take a while. Searched 137547960 bases in 7 sequences

                                                    *******************FILTER TEs starts at Mon Oct 10 17:51:58 2022*********************

Parameters for the detection of the PRESENCE of the given sequence(s):

Strain data: /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/example/fastq_files/example/example_1.fastq readdir() attempted on invalid dirhandle DIR4 at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 1783. cat: tlex_example/Tpresence/*/detection/results: No such file or directory closedir() attempted on invalid dirhandle DIR4 at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 1810.

                               ******************** Presence detection end at Mon Oct 10 17:51:59 2022 ************

                               ******************** Launch Absence detection start at Mon Oct 10 17:51:59 2022********************

Parameters for the detection of the ABSENCE of the given sequence(s):

TE list cleaned

mkdir Talign PRESENCE ALIGNMENT presence_detection directory does not exist ! /mainfs/scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/RepeatMasker/tlex_example readdir() attempted on invalid dirhandle DIR at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 685, line 1. readdir() attempted on invalid dirhandle DIR at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 685, line 2. readdir() attempted on invalid dirhandle DIR at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 685, line 3. readdir() attempted on invalid dirhandle DIR at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 685, line 4. readdir() attempted on invalid dirhandle DIR at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 685, line 5. rm: cannot remove 'Talign/presence_detection/*.contig_ref': No such file or directory closedir() attempted on invalid dirhandle DIR at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 710, line 5. TYPE: 0 Output directory: tlex_example

mkdir Talign mkdir: cannot create directory 'Talign': File exists ABSENCE ALIGNMENT Talign/ directory exists ! absence_detection directory does not exist ! readdir() attempted on invalid dirhandle DIR at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 736, line 5. readdir() attempted on invalid dirhandle DIR at /scratch/ar14g12/PhD/tomato/TE/Tlex3/T-lex3/tlex-open-v3.0.pl line 752, line 5.

image
m-bogaerts commented 1 year ago

Hello,

Sorry, I was out of the office for few days. Could you take a screenshot of the results folder (i.e. example). Maybe it is better if you contact me by mail: maria.bogaerts-marquez@inrae.fr. I am afraid it is something about folder name or similar (this software is too sensitive about these things).