arendsee / fagin

Classify genes using a syntenic filter
GNU General Public License v3.0
0 stars 0 forks source link

Object 'prot2allorf.cutoff' not found #5

Closed arendsee closed 7 years ago

arendsee commented 7 years ago

Running Fagin generates the error message:

Error in eval(expr, envir, enclos) : 
  object 'prot2allorf.cutoff' not found
Calls: knit ... in_dir -> inline_exec -> withVisible -> eval -> eval
In addition: Warning messages:
1: running command 'kpsewhich framed.sty' had status 1 
2: In test_latex_pkg("framed", system.file("misc", "framed.sty", package = "knitr")) :
  unable to find LaTeX package 'framed'; will use a copy from knitr
3: In testSI(si, tinfo, qinfo) :
  Found 51 search intervals with stop < start. Possible bug
    in Synder. You should NOT proceed.
4: In testSI(si, tinfo, qinfo) :
  66 target intervals end after expected terminus of the
    target chromosome. If this is very bad.
5: In testSI(si, tinfo, qinfo) :
  66 target intervals end after expected terminus of the
    target chromosome. If this is very bad.
6: In LoadSearchIntervals(sifile, qinfo = qinfo, tinfo = tinfo) :
  Found 2 cases where query length X search interval length
    is greater than 100000000. This will cause problems in the current implementation
    of query-vs-search interval DNA search, so these intervals will not be
    searched for dna2dna similarity.
7: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 66 out-of-bound ranges located on
  sequences AABY01000679.1, AABY01000781.1, AABY01000562.1,
  AABY01000227.1, AABY01000225.1, AABY01000276.1, AABY01000356.1,
  AABY01000209.1, AABY01000216.1, AABY01000323.1, AABY01000518.1,
  AABY01000436.1, AABY01000558.1, AABY01000508.1, AABY01000488.1,
  AABY01000489.1, AABY01000580.1, AABY01000551.1, AABY01000767.1,
  AABY01000513.1, AABY01000686.1, AABY01000527.1, AABY01000665.1,
  AABY01000479.1, AABY01000538.1, AABY01000688.1, AABY01000822.1,
  AABY01000517.1, AABY01000809.1, AABY01000528.1, AABY01000666.1,
  AABY01000698.1, AABY01000609.1, AABY01000796.1, AABY01000710.1,
  AABY01000505.1, AABY01000676.1, AABY01000604.1, AABY01000500.1,
  AABY01000560.1, AABY01000740.1, AABY01000318.1, AABY01000693.1,
  AABY01000235.1, AABY01000351.1, AABY01000378.1, AABY01000175.1,
  AABY01000304.1, and AABY01000313.1. Note that only ranges located
  on a non-circular sequence whose length is not NA can be
  considered out-of-bound (us [... truncated]
8: In LoadGFF(gfffile, seqinfo = tinfo) :
  18177 GFF entries are missing ids (ID=<id> labels in attribute column)

Execution halted
make[1]: *** [report.pdf] Error 1
make[1]: Leaving directory `/home/jingli/Desktop/fagin/src/report'
make: *** [run] Error 2
arendsee commented 7 years ago

Could you post the first 50 lines of so of the focal and first target GFF files?

lijing28101 commented 7 years ago

Focal: Saccharomyces_.cerevisiaegff (Other two targets from NCBI have same format)

NC_001133.9 RefSeq  telomere    1   801 .   -   .   ID=id1;Parent=-
NC_001133.9 RefSeq  origin_of_replication   707 776 .   +   .   ID=id2;Parent=-
NC_001133.9 RefSeq  gene    1807    2169    .   -   .   ID=PAU8;Parent=-
NC_001133.9 RefSeq  mRNA    1807    2169    .   -   .   ID=NM_001180043.1;Parent=PAU8
NC_001133.9 RefSeq  exon    1807    2169    .   -   .   ID=id3;Parent=NM_001180043.1
NC_001133.9 RefSeq  CDS 1807    2169    .   -   0   ID=NP_009332.1;Parent=NM_001180043.1
NC_001133.9 RefSeq  gene    2480    2707    .   +   .   ID=YAL067W-A;Parent=-
NC_001133.9 RefSeq  mRNA    2480    2707    .   +   .   ID=NM_001184582.1;Parent=YAL067W-A
NC_001133.9 RefSeq  exon    2480    2707    .   +   .   ID=id4;Parent=NM_001184582.1
NC_001133.9 RefSeq  CDS 2480    2707    .   +   0   ID=NP_878038.1;Parent=NM_001184582.1
NC_001133.9 RefSeq  gene    7235    9016    .   -   .   ID=SEO1;Parent=-
NC_001133.9 RefSeq  mRNA    7235    9016    .   -   .   ID=NM_001178208.1;Parent=SEO1
NC_001133.9 RefSeq  exon    7235    9016    .   -   .   ID=id5;Parent=NM_001178208.1
NC_001133.9 RefSeq  CDS 7235    9016    .   -   0   ID=NP_009333.1;Parent=NM_001178208.1
NC_001133.9 RefSeq  origin_of_replication   7997    8547    .   +   .   ID=id6;Parent=-
NC_001133.9 RefSeq  gene    11565   11951   .   -   .   ID=YAL065C;Parent=-
NC_001133.9 RefSeq  mRNA    11565   11951   .   -   .   ID=NM_001179897.1;Parent=YAL065C
NC_001133.9 RefSeq  exon    11565   11951   .   -   .   ID=id7;Parent=NM_001179897.1
NC_001133.9 RefSeq  CDS 11565   11951   .   -   0   ID=NP_009335.1;Parent=NM_001179897.1
NC_001133.9 RefSeq  gene    12046   12426   .   +   .   ID=YAL064W-B;Parent=-
NC_001133.9 RefSeq  mRNA    12046   12426   .   +   .   ID=NM_001180042.1;Parent=YAL064W-B
NC_001133.9 RefSeq  exon    12046   12426   .   +   .   ID=id8;Parent=NM_001180042.1
NC_001133.9 RefSeq  CDS 12046   12426   .   +   0   ID=NP_009336.1;Parent=NM_001180042.1
NC_001133.9 RefSeq  gene    13363   13743   .   -   .   ID=TDA8;Parent=-
NC_001133.9 RefSeq  mRNA    13363   13743   .   -   .   ID=NM_001180041.1;Parent=TDA8
NC_001133.9 RefSeq  exon    13363   13743   .   -   .   ID=id9;Parent=NM_001180041.1
NC_001133.9 RefSeq  CDS 13363   13743   .   -   0   ID=NP_058136.1;Parent=NM_001180041.1
NC_001133.9 RefSeq  gene    21566   21850   .   +   .   ID=YAL064W;Parent=-
NC_001133.9 RefSeq  mRNA    21566   21850   .   +   .   ID=NM_001178206.2;Parent=YAL064W
NC_001133.9 RefSeq  exon    21566   21850   .   +   .   ID=id10;Parent=NM_001178206.2
NC_001133.9 RefSeq  CDS 21566   21850   .   +   0   ID=NP_009337.2;Parent=NM_001178206.2
NC_001133.9 RefSeq  repeat_region   22230   22552   .   +   .   ID=id11;Parent=-
NC_001133.9 RefSeq  gene    22395   22685   .   -   .   ID=YAL063C-A;Parent=-
NC_001133.9 RefSeq  mRNA    22395   22685   .   -   .   ID=NM_001184642.1;Parent=YAL063C-A
NC_001133.9 RefSeq  exon    22395   22685   .   -   .   ID=id12;Parent=NM_001184642.1
NC_001133.9 RefSeq  CDS 22395   22685   .   -   0   ID=NP_878039.1;Parent=NM_001184642.1
NC_001133.9 RefSeq  gene    24000   27968   .   -   .   ID=FLO9;Parent=-
NC_001133.9 RefSeq  mRNA    24000   27968   .   -   .   ID=NM_001178205.1;Parent=FLO9
NC_001133.9 RefSeq  exon    24000   27968   .   -   .   ID=id13;Parent=NM_001178205.1
NC_001133.9 RefSeq  CDS 24000   27968   .   -   0   ID=NP_009338.1;Parent=NM_001178205.1
NC_001133.9 RefSeq  origin_of_replication   30946   31183   .   +   .   ID=id14;Parent=-
NC_001133.9 RefSeq  gene    31567   32940   .   +   .   ID=GDH3;Parent=-
NC_001133.9 RefSeq  mRNA    31567   32940   .   +   .   ID=NM_001178204.1;Parent=GDH3
NC_001133.9 RefSeq  exon    31567   32940   .   +   .   ID=id15;Parent=NM_001178204.1
NC_001133.9 RefSeq  CDS 31567   32940   .   +   0   ID=NP_009339.1;Parent=NM_001178204.1
NC_001133.9 RefSeq  gene    33448   34701   .   +   .   ID=BDH2;Parent=-
NC_001133.9 RefSeq  mRNA    33448   34701   .   +   .   ID=NM_001178203.1;Parent=BDH2
NC_001133.9 RefSeq  exon    33448   34701   .   +   .   ID=id16;Parent=NM_001178203.1
NC_001133.9 RefSeq  CDS 33448   34701   .   +   0   ID=NP_009340.1;Parent=NM_001178203.1

Saccharomyces_kudriavzevii.gff (Four GFF from braker have this format. Not every feature has ID/Name and Parent)

JH797291.1  AUGUSTUS    gene    1000    1968    0.49    -   .   g3896
JH797291.1  AUGUSTUS    exon    1000    1968    0.49    -   .   Parent=g3896.t1
JH797291.1  AUGUSTUS    CDS 1000    1968    .   -   0   Parent=g3896.t1
JH797291.1  AUGUSTUS    mRNA    1000    1968    .   -   .   ID=g3896.t1;Other=g3896
JH797504.1  AUGUSTUS    CDS 10008   10607   .   -   0   Parent=g4527.t1
JH797504.1  AUGUSTUS    exon    10008   10607   1.00    -   .   Parent=g4527.t1
JH797504.1  AUGUSTUS    gene    10008   10607   1   -   .   g4527
JH797504.1  AUGUSTUS    mRNA    10008   10607   .   -   .   ID=g4527.t1;Other=g4527
JH796508.1  AUGUSTUS    gene    10008   11024   0.74    -   .   g1839
JH796508.1  AUGUSTUS    exon    10008   11024   0.74    -   .   Parent=g1839.t1
JH796508.1  AUGUSTUS    CDS 10008   11024   .   -   0   Parent=g1839.t1
JH796508.1  AUGUSTUS    mRNA    10008   11024   .   -   .   ID=g1839.t1;Other=g1839
JH797932.1  AUGUSTUS    exon    100 1214    0.48    +   .   Parent=g5631.t1
JH797932.1  AUGUSTUS    CDS 100 1214    .   +   2   Parent=g5631.t1
JH797932.1  AUGUSTUS    mRNA    100 1214    .   +   .   ID=g5631.t1;Other=g5631
JH797398.1  AUGUSTUS    gene    10014   10952   0.46    +   .   g4187
JH797398.1  AUGUSTUS    exon    10014   10952   0.46    +   .   Parent=g4187.t1
JH797398.1  AUGUSTUS    CDS 10014   10952   .   +   0   Parent=g4187.t1
JH797398.1  AUGUSTUS    mRNA    10014   10952   .   +   .   ID=g4187.t1;Other=g4187
JH796421.1  AUGUSTUS    CDS 10016   11785   .   +   0   Parent=g1294.t1
JH796421.1  AUGUSTUS    exon    10016   11785   1.00    +   .   Parent=g1294.t1
JH796421.1  AUGUSTUS    gene    10016   11785   1   +   .   g1294
JH796421.1  AUGUSTUS    mRNA    10016   11785   .   +   .   ID=g1294.t1;Other=g1294
JH797524.1  AUGUSTUS    CDS 10030   11802   .   +   0   Parent=g4589.t1
JH797524.1  AUGUSTUS    exon    10030   11802   1.00    +   .   Parent=g4589.t1
JH797524.1  AUGUSTUS    gene    10030   11802   1   +   .   g4589
JH797524.1  AUGUSTUS    mRNA    10030   11802   .   +   .   ID=g4589.t1;Other=g4589
JH796931.1  AUGUSTUS    gene    1003    1716    0.99    -   .   g2897
JH796931.1  AUGUSTUS    exon    1003    1716    0.99    -   .   Parent=g2897.t1
JH796931.1  AUGUSTUS    CDS 1003    1716    .   -   0   Parent=g2897.t1
JH796931.1  AUGUSTUS    mRNA    1003    1716    .   -   .   ID=g2897.t1;Other=g2897
JH797655.1  AUGUSTUS    exon    1003    1763    0.89    -   .   Parent=g4975.t1
JH797655.1  AUGUSTUS    CDS 1003    1763    .   -   0   Parent=g4975.t1
JH797037.1  AUGUSTUS    CDS 10036   10911   .   -   0   Parent=g3333.t1
JH797037.1  AUGUSTUS    exon    10036   10911   1.00    -   .   Parent=g3333.t1
JH797037.1  AUGUSTUS    gene    10036   10911   1   -   .   g3333
JH797037.1  AUGUSTUS    mRNA    10036   10911   .   -   .   ID=g3333.t1;Other=g3333
JH796401.1  AUGUSTUS    CDS 1004    2005    .   +   0   Parent=g1189.t1
JH796401.1  AUGUSTUS    exon    1004    2005    1.00    +   .   Parent=g1189.t1
JH796401.1  AUGUSTUS    gene    1004    2005    1   +   .   g1189
JH796401.1  AUGUSTUS    mRNA    1004    2005    .   +   .   ID=g1189.t1;Other=g1189
JH796516.1  AUGUSTUS    CDS 10042   10905   .   -   0   Parent=g1889.t1
JH796516.1  AUGUSTUS    exon    10042   10905   1.00    -   .   Parent=g1889.t1
JH796516.1  AUGUSTUS    gene    10042   10905   1   -   .   g1889
JH796516.1  AUGUSTUS    mRNA    10042   10905   .   -   .   ID=g1889.t1;Other=g1889
JH798027.1  AUGUSTUS    CDS 10047   11501   .   -   0   Parent=g6136.t1
JH798027.1  AUGUSTUS    exon    10047   11501   1.00    -   .   Parent=g6136.t1
JH798027.1  AUGUSTUS    gene    10047   11501   1   -   .   g6136
JH798027.1  AUGUSTUS    mRNA    10047   11501   .   -   .   ID=g6136.t1;Other=g6136
urmi-21 commented 7 years ago

@arendsee I suspect this problem is related to Knitr. See this issue for example http://rmarkdown.rstudio.com/authoring_migrating_from_v1.html or https://github.com/rstudio/rmarkdown/issues/149 I don't know much about knitr so can't say for sure but i suspect the object not found type of error has to do something with how Knitr is handling data.

arendsee commented 7 years ago

The message

3: In testSI(si, tinfo, qinfo) :
  Found 51 search intervals with stop < start. Possible bug
    in Synder. You should NOT proceed.
4: In testSI(si, tinfo, qinfo) :
  66 target intervals end after expected terminus of the
    target chromosome. If this is very bad.
5: In testSI(si, tinfo, qinfo) :
  66 target intervals end after expected terminus of the
    target chromosome. If this is very bad.

Indicates that there is something fishy with the inputs. This is a bit of a hack, but try subtracting 2 from the stop position of each row in your synteny map. You can do this with AWK. This is not a solution to the problem, but if it works, it will tell us what the problem is.

@urmi-21 There is a knitr issue we ran into earlier (which required we remove caching from all the code snippets). But I don't think this is caused by knitr. At least, the warnings that are being produced have nothing to do with knitr.

arendsee commented 7 years ago

There is another problem which was Fagin caught:

8: In LoadGFF(gfffile, seqinfo = tinfo) :
  18177 GFF entries are missing ids (ID=<id> labels in attribute column)

The problem is that not all the rows in the GFF have ID tags. Fagin does some parsing of the tags in the attribute column. This is a pretty tricky business, since the contents of the 9th column of a GFF file are not standardized. Fagin handles the RefSeq and Ensembl format fine. But the AUGUSTUS-produced GFF you are using (Saccharomyces_kudriavzevii.gff) has a different format. The only thing that is missing is a unique id for each element. You can add your own with a bit of AWK magic. Just do something like this:

sed 's/ID=/OLD_ID=/'  myfile.gff | awk 'BEGIN{OFS="\t"} /^[^#]/ {$9 =  "ID=" a++ ";" $9} {print}

I haven't tested the script above, so it probably doesn't quite work, but what it is intended to do is remove any existing ID tags (renaming them so we keep anything useful they may contain) and then adds a new unique id tag.

arendsee commented 7 years ago

@lijing28101 Try the two suggestions above and tell me how it goes.

lijing28101 commented 7 years ago

@arendsee I didn't find any intervals in the synteny map with the stop < start. The script for adding ID works.

arendsee commented 7 years ago

@lijing28101 There definitely shouldn't be any intervals in the synteny map where start > stop. But there could be cases where the stop in the search interval is greater than the length of the scaffold in the GFF. This could happen if the offsets to synder are wrong (that 0011 string) or if the assembly used to build the synteny maps isn't the same as the one used in the downstream fagin analysis. Subtracting 2 from the synteny maps would "fix" the first problem.

So does fagin run correctly after adding the IDs?

lijing28101 commented 7 years ago

@arendsee It doesn't work. Still has same error

  object 'prot2allorf.cutoff' not found
Calls: knit ... in_dir -> inline_exec -> withVisible -> eval -> eval 

Only the missing IDs problem has been solved.

lijing28101 commented 7 years ago

@arendsee After I subtracting 2 from syneny map, running synder again, the other warning messages are solved, only the last one and the error:

Error in eval(expr, envir, enclos) : 
  object 'prot2allorf.cutoff' not found
Calls: knit ... in_dir -> inline_exec -> withVisible -> eval -> eval
In addition: Warning message:
In LoadSearchIntervals(sifile, qinfo = qinfo, tinfo = tinfo) :
  Found 2 cases where query length X search interval length
    is greater than 100000000. This will cause problems in the current implementation
    of query-vs-search interval DNA search, so these intervals will not be
    searched for dna2dna similarity.

What's the function of prot2allorf.cutoff?

arendsee commented 7 years ago

prot2allorf.cutoff is just a constant. If it wasn't set, it means something bad happened upstream. The warning message shouldn't lead to the program crashing, so something else is wrong.

lijing28101 commented 7 years ago

@arendsee the log file has nothing. How can I detect where is wrong?

arendsee commented 7 years ago

Can you post all the input data? I'll need full genomes for each species, the GFF files, and the phylogenetic tree. Put it all in a TAR archive and compress it. I'll run it on my end and figure out what is going wrong. Then I'll update Fagin so that it either deals with the problem directly or at least reports a specific error message early.

lijing28101 commented 7 years ago

I have shared the file with you through google drive. Do you receive the email? I sent it to your hotmail.

arendsee commented 7 years ago

How big is the tar file? And can you compress it?

tar -cjf input.tar.bz2 inputs/
arendsee commented 7 years ago

I can't access things through links to Google drive, could you directly email the compressed data to my hotmail account?

lijing28101 commented 7 years ago

It's over 25MB. I cannot send you through email. I send it to you through slack.

arendsee commented 7 years ago

OK, that works, I'll look it over and get back with you in a bit

arendsee commented 7 years ago

@lijing28101 I'm working on going cleaning up the GFF input script. I've added several new checks and made it a little more flexible. I haven't yet run your data all the way through the pipeline. I could probably force your data through by fixing the problems specific to your data. But I think it is better for me to take the time to refactor everything so these problems don't come back. And doing this well will take awhile, probably a week or more (much more, depending on how deeply I decide to refactor).

I'll close this issue once your data pass through successfully AND I have the checks, transformations, and documentation in place to prevent similar difficulties in the future.

arendsee commented 7 years ago

@lijing28101 OK, my last commit should fix everything on the Fagin end.

There are a couple issues you need to address with your data.

  1. Your file orphan-list.txt isn't correct. The ids in the file need to match the ids of gene models in the focal GFF file, e.g. NM_001182500.1, but the file you provide has ids of the form rna1, which doesn't match anything in the GFF. You may need to rerun the orphan blast predictions using the correct ids, or find a map between the two.

  2. The S. cerevisiae GFF file doesn't follow a consistent convention for describing gene models. For most of the genes, the CDS and exon have a Parent tag in the 9th column that links to an mRNA tag. Fagin understands this. However, for some of the models (including all those in the mitochondria) the CDS entries have Parent tags that link to a gene's Name field instead. Fagin handles this problem by removing these entries and printing a warning message. You can ignore the messages for now.

  3. Remapping IDs, as I suggested you do a few posts back, is no longer necessary (and indeed wouldn't have worked in the first place), so you should go back to using the original GFF files.