TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
138 stars 20 forks source link

In these clusters: **_clusTErs,**_mergedRepeats,**_summaryFiles, you get the following error: #20

Closed Zoey1001 closed 2 years ago

Zoey1001 commented 2 years ago

Progress:3571426/3571437... Progress:3571427/3571437... Progress:3571428/3571437... Progress:3571429/3571437... Progress:3571430/3571437... Progress:3571431/3571437... Progress:3571432/3571437... Progress:3571433/3571437... Progress:3571434/3571437... Progress:3571435/3571437... Progress:3571436/3571437...Step 5: Merging GFF records by labels... Step 6: Writing stat file..Removing tmp files... Done Traceback (most recent call last): File "/home/dell/EarlGrey/scripts/repeatCraft/repeatcraft.py", line 187, in rcStatm.rcstat(rclabelp=outputnamelabel,rmergep=outputnamemerge,outfile= statfname, ltrgroup = True) File "/home/dell/EarlGrey/scripts/repeatCraft/helper/rcStatm.py", line 54, in rcstat if rowRaw.get(col[2]): IndexError: list index out of range


< Resolving Overlapping Repeats >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Loading required package: stats4 Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

expand.grid, I, unname

Loading required package: IRanges Loading required package: GenomeInfoDb Warning messages: 1: package ‘GenomicRanges’ was built under R version 4.1.2 2: package ‘S4Vectors’ was built under R version 4.1.2 3: package ‘IRanges’ was built under R version 4.1.2 Warning message: package ‘ape’ was built under R version 4.1.2 [1] "/home/dell/miniconda3/envs/earlGrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/dell/EarlGrey/scripts/filteringOverlappingRepeats.R"
[5] "--args"
[6] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.sorted"
[7] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.filtered" Error: package or namespace load failed for ‘tidyverse’: package ‘rlang’ was installed before R 4.0.0: please re-install it Execution halted cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed': No such file or directory Traceback (most recent call last): File "/home/dell/EarlGrey/scripts/backSwap.py", line 14, in table = pd.read_csv(input, names = ['scaf', 'start', 'end', 'repeat', 'score', 'strand'], delim_whitespace = True, header = None) File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv return _read(filepath_or_buffer, kwds) File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read parser = TextFileReader(fp_or_buf, kwds) File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in init self._make_engine(self.engine) File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/home/dell/miniconda3/envs/earlGrey/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in init self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source FileNotFoundError: [Errno 2] No such file or directory: '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed' mv: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed.2': No such file or directory Error: package or namespace load failed for ‘tidyverse’: package ‘rlang’ was installed before R 4.0.0: please re-install it Execution halted


< Done! >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

< Generating Summary Plots >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Error: package or namespace load failed for ‘tidyverse’: package ‘rlang’ was installed before R 4.0.0: please re-install it Execution halted Error: package or namespace load failed for ‘tidyverse’: package ‘rlang’ was installed before R 4.0.0: please re-install it Execution halted


< Identifying TE Clusters and Member Sequences >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Error: Unable to open file /home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed. Exiting. Error: The requested file (/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed) could not be opened. Error message: (No such file or directory). Exiting!


< Tidying Directories and Organising Important Files >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed': No such file or directory cp: cannot stat '/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.gff': No such file or directory


< Done in 01:05:05.00 >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

< De Novo Library, Combined Library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in /home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_summaryFiles/ >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Could you please tell me how to solve it? thank you!

TobyBaril commented 2 years ago

Hi,

The first error is something that gets raised in RepeatCraft, but has no effect (confirmed by the author of RepeatCraft), so can be ignored.

The following ones look to be due to the version of R you have installed being old: package ‘rlang’ was installed before R 4.0.0: please re-install it.

Try installing the latest version of R when the earlGrey environment is active:

conda install -c conda-forge r-base
conda install -c conda-forge r-rlang

and this should hopefully fix the issue!

Zoey1001 commented 2 years ago

OK,[Thank you for your reply and I have fix the problem by re-install R,and i got the results: Step 6: Writing stat file..Removing tmp files... Done Traceback (most recent call last): File "/home/dell/EarlGrey/scripts/repeatCraft/repeatcraft.py", line 187, in rcStatm.rcstat(rclabelp=outputnamelabel,rmergep=outputnamemerge,outfile= statfname, ltrgroup = True) File "/home/dell/EarlGrey/scripts/repeatCraft/helper/rcStatm.py", line 54, in rcstat if rowRaw.get(col[2]): IndexError: list index out of range


< Resolving Overlapping Repeats >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

Loading required package: stats4 Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

expand.grid, I, unname

Loading required package: IRanges Loading required package: GenomeInfoDb Warning messages: 1: package ‘GenomicRanges’ was built under R version 4.1.2 2: package ‘S4Vectors’ was built under R version 4.1.2 3: package ‘IRanges’ was built under R version 4.1.2 Warning message: package ‘ape’ was built under R version 4.1.2 [1] "/home/dell/miniconda3/envs/earlGrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/dell/EarlGrey/scripts/filteringOverlappingRepeats.R"
[5] "--args"
[6] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.sorted"
[7] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.filtered" ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.5 ✔ dplyr 1.0.7 ✔ tidyr 1.1.4 ✔ stringr 1.4.0 ✔ readr 2.0.2 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

You have loaded plyr after dplyr - this is likely to cause problems. If you need functions from both plyr and dplyr, please load plyr first, then dplyr: library(plyr); library(dplyr)

Attaching package: ‘plyr’

The following objects are masked from ‘package:dplyr’:

arrange, count, desc, failwith, id, mutate, rename, summarise,
summarize

The following object is masked from ‘package:purrr’:

compact

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

set_names

The following object is masked from ‘package:tidyr’:

extract

Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

between, first, last

The following object is masked from ‘package:purrr’:

transpose

[1] "/home/dell/miniconda3/envs/earlGrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/dell/EarlGrey/scripts/mergeRepeats.R"
[5] "--args"
[6] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.filtered"
[7] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.mergedRepeats.bed"
[8] "2592532507"
[9] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.mergedRepeats.revisedTable" [10] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed"
[11] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.summary"
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.5 ✔ dplyr 1.0.7 ✔ tidyr 1.1.4 ✔ stringr 1.4.0 ✔ readr 2.0.2 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() [1] "/home/dell/miniconda3/envs/earlGrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/dell/EarlGrey/scripts/makeGff.R"
[5] "--args"
[6] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed" [7] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.rmerge.gff.filtered" [8] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.gff"


< Done! >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

< Generating Summary Plots >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.5 ✔ dplyr 1.0.7 ✔ tidyr 1.1.4 ✔ stringr 1.4.0 ✔ readr 2.0.2 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

between, first, last

The following object is masked from ‘package:purrr’:

transpose

[1] "/home/dell/miniconda3/envs/earlGrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/dell/EarlGrey/scripts/autoPie.R"
[5] "--args"
[6] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.bed" [7] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_mergedRepeats/looseMerge/P_l.filteredRepeats.gff" [8] "2592532507"
[9] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_summaryFiles/P_l.summaryPie.pdf"
[10] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_summaryFiles/P_l.highLevelCount.txt"
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.5 ✔ dplyr 1.0.7 ✔ tidyr 1.1.4 ✔ stringr 1.4.0 ✔ readr 2.0.2 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

between, first, last

The following object is masked from ‘package:purrr’:

transpose

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

set_names

The following object is masked from ‘package:tidyr’:

extract

[1] "/home/dell/miniconda3/envs/earlGrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/dell/EarlGrey/scripts/autoLand.R"
[5] "--args"
[6] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_RepeatLandscape/P_l.divsum"
[7] "2592532507"
[8] "P_l"
[9] "/home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_summaryFiles/P_l.repeatLandscape.pdf" summarise() has grouped output by 'species', 'classif'. You can override using the .groups argument. Completed: 100 Completed: 200 Completed: 300 Completed: 400 Completed: 500 Completed: 600 summarise() has grouped output by 'species', 'Divergence'. You can override using the .groups argument.


< Identifying TE Clusters and Member Sequences >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

< Tidying Directories and Organising Important Files >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

< Done in 01:14:53.00 >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

< De Novo Library, Combined Library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in /home/dell/1k_spider/zq/PL/repeat/P_lEarlGrey/P_l_summaryFiles/ >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/                
            ||----w |
            ||     ||

but,i check the result ,i found there are so much " Unclassified",coud you please tell me how to fix these. tclassif cov count proportion gen Number_of_Distinct_Classifications DNA 381289164 646989 0.147072086066615 2592532507 7536 LINE 114698753 169894 0.0442419729319907 2592532507 6933 LTR 104741925 89611 0.0404013931232068 2592532507 6829 Other (Simple Repeat, Microsatellite, RNA) 1386767 1317 0.000534908239821735 2592532507 396 Penelope 4598716 7150 0.0017738315672352 2592532507 1461 Rolling Circle 26069630 40011 0.0100556617630099 2592532507 3091 SINE 4759158 14464 0.00183571777293051 2592532507 1263 Unclassified 618695287 1101120 0.238645141509117 2592532507 8088

TobyBaril commented 2 years ago

There isn't necessarily a way to "fix" unclassfied TEs, as that is exactly what they are - they are unable to be classified by automated methods currently.

Depending on which TE libraries you specific with the -r flag, there may be many unclassified TE families. The TE landscape of a single species can be very different from other even closely related species, and a lack of sampling for closely related species can mean that many TEs cannot be classified, but are identified as TEs by de novo methods, like RepeatModeler employed here.

A good example of this is in the Earl Grey preprint. When we only used RepeatMasker with a library of known TEs for Coleoptera, M. bipustulatus appeared to have a TE content ~3% and of mostly known repeats. However when we used de novo methods, we actually estimate TE content to be between 31.86% and 56.57%, of which many are unclassified.

Overall, depending on the species you are looking at, and the availability of sequences for related species, you may find a high number of unclassified TEs. These still have the potential to be real TEs that we don't currently have an understanding of, or they are not similar enough to known TEs to be classified. Some of these TE families can probably be classified using manual annotation, and as sampling of TEs in more diverse genomes increases, the level of unclassified families should decrease.

Zoey1001 commented 2 years ago

Okay, I got it. Thank you!