Closed arendsee closed 7 years ago
Could you post the first 50 lines of so of the focal and first target GFF files?
Focal: Saccharomyces_.cerevisiaegff (Other two targets from NCBI have same format)
NC_001133.9 RefSeq telomere 1 801 . - . ID=id1;Parent=-
NC_001133.9 RefSeq origin_of_replication 707 776 . + . ID=id2;Parent=-
NC_001133.9 RefSeq gene 1807 2169 . - . ID=PAU8;Parent=-
NC_001133.9 RefSeq mRNA 1807 2169 . - . ID=NM_001180043.1;Parent=PAU8
NC_001133.9 RefSeq exon 1807 2169 . - . ID=id3;Parent=NM_001180043.1
NC_001133.9 RefSeq CDS 1807 2169 . - 0 ID=NP_009332.1;Parent=NM_001180043.1
NC_001133.9 RefSeq gene 2480 2707 . + . ID=YAL067W-A;Parent=-
NC_001133.9 RefSeq mRNA 2480 2707 . + . ID=NM_001184582.1;Parent=YAL067W-A
NC_001133.9 RefSeq exon 2480 2707 . + . ID=id4;Parent=NM_001184582.1
NC_001133.9 RefSeq CDS 2480 2707 . + 0 ID=NP_878038.1;Parent=NM_001184582.1
NC_001133.9 RefSeq gene 7235 9016 . - . ID=SEO1;Parent=-
NC_001133.9 RefSeq mRNA 7235 9016 . - . ID=NM_001178208.1;Parent=SEO1
NC_001133.9 RefSeq exon 7235 9016 . - . ID=id5;Parent=NM_001178208.1
NC_001133.9 RefSeq CDS 7235 9016 . - 0 ID=NP_009333.1;Parent=NM_001178208.1
NC_001133.9 RefSeq origin_of_replication 7997 8547 . + . ID=id6;Parent=-
NC_001133.9 RefSeq gene 11565 11951 . - . ID=YAL065C;Parent=-
NC_001133.9 RefSeq mRNA 11565 11951 . - . ID=NM_001179897.1;Parent=YAL065C
NC_001133.9 RefSeq exon 11565 11951 . - . ID=id7;Parent=NM_001179897.1
NC_001133.9 RefSeq CDS 11565 11951 . - 0 ID=NP_009335.1;Parent=NM_001179897.1
NC_001133.9 RefSeq gene 12046 12426 . + . ID=YAL064W-B;Parent=-
NC_001133.9 RefSeq mRNA 12046 12426 . + . ID=NM_001180042.1;Parent=YAL064W-B
NC_001133.9 RefSeq exon 12046 12426 . + . ID=id8;Parent=NM_001180042.1
NC_001133.9 RefSeq CDS 12046 12426 . + 0 ID=NP_009336.1;Parent=NM_001180042.1
NC_001133.9 RefSeq gene 13363 13743 . - . ID=TDA8;Parent=-
NC_001133.9 RefSeq mRNA 13363 13743 . - . ID=NM_001180041.1;Parent=TDA8
NC_001133.9 RefSeq exon 13363 13743 . - . ID=id9;Parent=NM_001180041.1
NC_001133.9 RefSeq CDS 13363 13743 . - 0 ID=NP_058136.1;Parent=NM_001180041.1
NC_001133.9 RefSeq gene 21566 21850 . + . ID=YAL064W;Parent=-
NC_001133.9 RefSeq mRNA 21566 21850 . + . ID=NM_001178206.2;Parent=YAL064W
NC_001133.9 RefSeq exon 21566 21850 . + . ID=id10;Parent=NM_001178206.2
NC_001133.9 RefSeq CDS 21566 21850 . + 0 ID=NP_009337.2;Parent=NM_001178206.2
NC_001133.9 RefSeq repeat_region 22230 22552 . + . ID=id11;Parent=-
NC_001133.9 RefSeq gene 22395 22685 . - . ID=YAL063C-A;Parent=-
NC_001133.9 RefSeq mRNA 22395 22685 . - . ID=NM_001184642.1;Parent=YAL063C-A
NC_001133.9 RefSeq exon 22395 22685 . - . ID=id12;Parent=NM_001184642.1
NC_001133.9 RefSeq CDS 22395 22685 . - 0 ID=NP_878039.1;Parent=NM_001184642.1
NC_001133.9 RefSeq gene 24000 27968 . - . ID=FLO9;Parent=-
NC_001133.9 RefSeq mRNA 24000 27968 . - . ID=NM_001178205.1;Parent=FLO9
NC_001133.9 RefSeq exon 24000 27968 . - . ID=id13;Parent=NM_001178205.1
NC_001133.9 RefSeq CDS 24000 27968 . - 0 ID=NP_009338.1;Parent=NM_001178205.1
NC_001133.9 RefSeq origin_of_replication 30946 31183 . + . ID=id14;Parent=-
NC_001133.9 RefSeq gene 31567 32940 . + . ID=GDH3;Parent=-
NC_001133.9 RefSeq mRNA 31567 32940 . + . ID=NM_001178204.1;Parent=GDH3
NC_001133.9 RefSeq exon 31567 32940 . + . ID=id15;Parent=NM_001178204.1
NC_001133.9 RefSeq CDS 31567 32940 . + 0 ID=NP_009339.1;Parent=NM_001178204.1
NC_001133.9 RefSeq gene 33448 34701 . + . ID=BDH2;Parent=-
NC_001133.9 RefSeq mRNA 33448 34701 . + . ID=NM_001178203.1;Parent=BDH2
NC_001133.9 RefSeq exon 33448 34701 . + . ID=id16;Parent=NM_001178203.1
NC_001133.9 RefSeq CDS 33448 34701 . + 0 ID=NP_009340.1;Parent=NM_001178203.1
Saccharomyces_kudriavzevii.gff (Four GFF from braker have this format. Not every feature has ID/Name and Parent)
JH797291.1 AUGUSTUS gene 1000 1968 0.49 - . g3896
JH797291.1 AUGUSTUS exon 1000 1968 0.49 - . Parent=g3896.t1
JH797291.1 AUGUSTUS CDS 1000 1968 . - 0 Parent=g3896.t1
JH797291.1 AUGUSTUS mRNA 1000 1968 . - . ID=g3896.t1;Other=g3896
JH797504.1 AUGUSTUS CDS 10008 10607 . - 0 Parent=g4527.t1
JH797504.1 AUGUSTUS exon 10008 10607 1.00 - . Parent=g4527.t1
JH797504.1 AUGUSTUS gene 10008 10607 1 - . g4527
JH797504.1 AUGUSTUS mRNA 10008 10607 . - . ID=g4527.t1;Other=g4527
JH796508.1 AUGUSTUS gene 10008 11024 0.74 - . g1839
JH796508.1 AUGUSTUS exon 10008 11024 0.74 - . Parent=g1839.t1
JH796508.1 AUGUSTUS CDS 10008 11024 . - 0 Parent=g1839.t1
JH796508.1 AUGUSTUS mRNA 10008 11024 . - . ID=g1839.t1;Other=g1839
JH797932.1 AUGUSTUS exon 100 1214 0.48 + . Parent=g5631.t1
JH797932.1 AUGUSTUS CDS 100 1214 . + 2 Parent=g5631.t1
JH797932.1 AUGUSTUS mRNA 100 1214 . + . ID=g5631.t1;Other=g5631
JH797398.1 AUGUSTUS gene 10014 10952 0.46 + . g4187
JH797398.1 AUGUSTUS exon 10014 10952 0.46 + . Parent=g4187.t1
JH797398.1 AUGUSTUS CDS 10014 10952 . + 0 Parent=g4187.t1
JH797398.1 AUGUSTUS mRNA 10014 10952 . + . ID=g4187.t1;Other=g4187
JH796421.1 AUGUSTUS CDS 10016 11785 . + 0 Parent=g1294.t1
JH796421.1 AUGUSTUS exon 10016 11785 1.00 + . Parent=g1294.t1
JH796421.1 AUGUSTUS gene 10016 11785 1 + . g1294
JH796421.1 AUGUSTUS mRNA 10016 11785 . + . ID=g1294.t1;Other=g1294
JH797524.1 AUGUSTUS CDS 10030 11802 . + 0 Parent=g4589.t1
JH797524.1 AUGUSTUS exon 10030 11802 1.00 + . Parent=g4589.t1
JH797524.1 AUGUSTUS gene 10030 11802 1 + . g4589
JH797524.1 AUGUSTUS mRNA 10030 11802 . + . ID=g4589.t1;Other=g4589
JH796931.1 AUGUSTUS gene 1003 1716 0.99 - . g2897
JH796931.1 AUGUSTUS exon 1003 1716 0.99 - . Parent=g2897.t1
JH796931.1 AUGUSTUS CDS 1003 1716 . - 0 Parent=g2897.t1
JH796931.1 AUGUSTUS mRNA 1003 1716 . - . ID=g2897.t1;Other=g2897
JH797655.1 AUGUSTUS exon 1003 1763 0.89 - . Parent=g4975.t1
JH797655.1 AUGUSTUS CDS 1003 1763 . - 0 Parent=g4975.t1
JH797037.1 AUGUSTUS CDS 10036 10911 . - 0 Parent=g3333.t1
JH797037.1 AUGUSTUS exon 10036 10911 1.00 - . Parent=g3333.t1
JH797037.1 AUGUSTUS gene 10036 10911 1 - . g3333
JH797037.1 AUGUSTUS mRNA 10036 10911 . - . ID=g3333.t1;Other=g3333
JH796401.1 AUGUSTUS CDS 1004 2005 . + 0 Parent=g1189.t1
JH796401.1 AUGUSTUS exon 1004 2005 1.00 + . Parent=g1189.t1
JH796401.1 AUGUSTUS gene 1004 2005 1 + . g1189
JH796401.1 AUGUSTUS mRNA 1004 2005 . + . ID=g1189.t1;Other=g1189
JH796516.1 AUGUSTUS CDS 10042 10905 . - 0 Parent=g1889.t1
JH796516.1 AUGUSTUS exon 10042 10905 1.00 - . Parent=g1889.t1
JH796516.1 AUGUSTUS gene 10042 10905 1 - . g1889
JH796516.1 AUGUSTUS mRNA 10042 10905 . - . ID=g1889.t1;Other=g1889
JH798027.1 AUGUSTUS CDS 10047 11501 . - 0 Parent=g6136.t1
JH798027.1 AUGUSTUS exon 10047 11501 1.00 - . Parent=g6136.t1
JH798027.1 AUGUSTUS gene 10047 11501 1 - . g6136
JH798027.1 AUGUSTUS mRNA 10047 11501 . - . ID=g6136.t1;Other=g6136
@arendsee I suspect this problem is related to Knitr. See this issue for example http://rmarkdown.rstudio.com/authoring_migrating_from_v1.html or https://github.com/rstudio/rmarkdown/issues/149 I don't know much about knitr so can't say for sure but i suspect the object not found type of error has to do something with how Knitr is handling data.
The message
3: In testSI(si, tinfo, qinfo) :
Found 51 search intervals with stop < start. Possible bug
in Synder. You should NOT proceed.
4: In testSI(si, tinfo, qinfo) :
66 target intervals end after expected terminus of the
target chromosome. If this is very bad.
5: In testSI(si, tinfo, qinfo) :
66 target intervals end after expected terminus of the
target chromosome. If this is very bad.
Indicates that there is something fishy with the inputs. This is a bit of a hack, but try subtracting 2 from the stop position of each row in your synteny map. You can do this with AWK. This is not a solution to the problem, but if it works, it will tell us what the problem is.
@urmi-21 There is a knitr issue we ran into earlier (which required we remove caching from all the code snippets). But I don't think this is caused by knitr. At least, the warnings that are being produced have nothing to do with knitr.
There is another problem which was Fagin caught:
8: In LoadGFF(gfffile, seqinfo = tinfo) :
18177 GFF entries are missing ids (ID=<id> labels in attribute column)
The problem is that not all the rows in the GFF have ID tags. Fagin does some parsing of the tags in the attribute column. This is a pretty tricky business, since the contents of the 9th column of a GFF file are not standardized. Fagin handles the RefSeq and Ensembl format fine. But the AUGUSTUS-produced GFF you are using (Saccharomyces_kudriavzevii.gff
) has a different format. The only thing that is missing is a unique id for each element. You can add your own with a bit of AWK magic. Just do something like this:
sed 's/ID=/OLD_ID=/' myfile.gff | awk 'BEGIN{OFS="\t"} /^[^#]/ {$9 = "ID=" a++ ";" $9} {print}
I haven't tested the script above, so it probably doesn't quite work, but what it is intended to do is remove any existing ID tags (renaming them so we keep anything useful they may contain) and then adds a new unique id tag.
@lijing28101 Try the two suggestions above and tell me how it goes.
@arendsee I didn't find any intervals in the synteny map with the stop < start. The script for adding ID works.
@lijing28101 There definitely shouldn't be any intervals in the synteny map where start > stop
. But there could be cases where the stop
in the search interval is greater than the length of the scaffold in the GFF. This could happen if the offsets to synder are wrong (that 0011
string) or if the assembly used to build the synteny maps isn't the same as the one used in the downstream fagin analysis. Subtracting 2 from the synteny maps would "fix" the first problem.
So does fagin run correctly after adding the IDs?
@arendsee It doesn't work. Still has same error
object 'prot2allorf.cutoff' not found
Calls: knit ... in_dir -> inline_exec -> withVisible -> eval -> eval
Only the missing IDs problem has been solved.
@arendsee After I subtracting 2 from syneny map, running synder again, the other warning messages are solved, only the last one and the error:
Error in eval(expr, envir, enclos) :
object 'prot2allorf.cutoff' not found
Calls: knit ... in_dir -> inline_exec -> withVisible -> eval -> eval
In addition: Warning message:
In LoadSearchIntervals(sifile, qinfo = qinfo, tinfo = tinfo) :
Found 2 cases where query length X search interval length
is greater than 100000000. This will cause problems in the current implementation
of query-vs-search interval DNA search, so these intervals will not be
searched for dna2dna similarity.
What's the function of prot2allorf.cutoff
?
prot2allorf.cutoff
is just a constant. If it wasn't set, it means something bad happened upstream. The warning message shouldn't lead to the program crashing, so something else is wrong.
@arendsee the log file has nothing. How can I detect where is wrong?
Can you post all the input data? I'll need full genomes for each species, the GFF files, and the phylogenetic tree. Put it all in a TAR archive and compress it. I'll run it on my end and figure out what is going wrong. Then I'll update Fagin so that it either deals with the problem directly or at least reports a specific error message early.
I have shared the file with you through google drive. Do you receive the email? I sent it to your hotmail.
How big is the tar file? And can you compress it?
tar -cjf input.tar.bz2 inputs/
I can't access things through links to Google drive, could you directly email the compressed data to my hotmail account?
It's over 25MB. I cannot send you through email. I send it to you through slack.
OK, that works, I'll look it over and get back with you in a bit
@lijing28101 I'm working on going cleaning up the GFF input script. I've added several new checks and made it a little more flexible. I haven't yet run your data all the way through the pipeline. I could probably force your data through by fixing the problems specific to your data. But I think it is better for me to take the time to refactor everything so these problems don't come back. And doing this well will take awhile, probably a week or more (much more, depending on how deeply I decide to refactor).
I'll close this issue once your data pass through successfully AND I have the checks, transformations, and documentation in place to prevent similar difficulties in the future.
@lijing28101 OK, my last commit should fix everything on the Fagin end.
There are a couple issues you need to address with your data.
Your file orphan-list.txt
isn't correct. The ids in the file need to match the ids of gene models in the focal GFF file, e.g. NM_001182500.1
, but the file you provide has ids of the form rna1
, which doesn't match anything in the GFF. You may need to rerun the orphan blast predictions using the correct ids, or find a map between the two.
The S. cerevisiae GFF file doesn't follow a consistent convention for describing gene models. For most of the genes, the CDS and exon have a Parent
tag in the 9th column that links to an mRNA
tag. Fagin understands this. However, for some of the models (including all those in the mitochondria) the CDS
entries have Parent
tags that link to a gene
's Name
field instead. Fagin handles this problem by removing these entries and printing a warning message. You can ignore the messages for now.
Remapping IDs, as I suggested you do a few posts back, is no longer necessary (and indeed wouldn't have worked in the first place), so you should go back to using the original GFF files.
Running Fagin generates the error message: