genometools / genomethreader

GenomeThreader gene prediction software.
ISC License
11 stars 5 forks source link

How to merge mutiple GFF3 files? #20

Open bioinformaticspcj opened 3 years ago

bioinformaticspcj commented 3 years ago

Dear the authors: Thanks for devoloping such an useful program. Now I have a question that if I have mutiple GFF3 files to train bssm, how can I merge these files for the gthbssmtrain program? I do not know how to sort the file as the program put an error:

"error: the file testPB.gff is not sorted (example: line 9 and 14)"

Thanks a gain and looking forward to your valuable help.

satta commented 3 years ago

Without being too familiar with the BSSM input requirements, I can say that you can use GenomeTools to preprocess your input. In general, you can use gt gff3 sort to sort individual GFF3 files. Use the -tidy option when it still complains about some format quirks. Then, you can use gt merge to merge all sorted files into one. The -help option can tell you more about how to use each individual tool.

bioinformaticspcj commented 3 years ago

Thanks a lot for your valuable advice. I have tried the genometools program but met another question as when I ran the conmand: 

/data/nfs/OriginTools/gene_anotation/structure/gth-1.7.1-Linux_x86_64-64bit/bin/gthbssmtrain -seqfile testPB.fa -force -outdir training_data3 testPB.gff

 it returned: "Assertion failed: (idx < gt_encseq_num_of_sequences(bs->encseq) && end >= start), function gt_bioseq_get_sequence_range, file src/core/bioseq.c, line 388. This is a bug, please report it at https://github.com/genometools/genometools/issues Please make sure you are running the latest release which can be found at http://genometools.org/pub/ You can check your version number with gt -version. Aborted (core dumped)"

I do not know why. The version of gt is 1.6.1.

Could you give me some advice? Thanks again. ------------------ 原始邮件 ------------------ 发件人: "genometools/genomethreader" <notifications@github.com>; 发送时间: 2020年11月28日(星期六) 凌晨1:13 收件人: "genometools/genomethreader"<genomethreader@noreply.github.com>; 抄送: "Monkeyjun"<2532090267@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [genometools/genomethreader] How to merge mutiple GFF3 files? (#20)

Without being too familiar with the BSSM input requirements, I can say that you can use GenomeTools to preprocess your input. In general, you can use gt gff3 sort to sort individual GFF3 files. Use the -tidy option when it still complains about some format quirks. Then, you can use gt merge to merge all sorted files into one. The -help option can tell you more about how to use each individual tool.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

satta commented 3 years ago

Looks like the problem seems to be related by the relation of sequences and annotations in your input. This would probably be difficult to debug without your input files (or a minimal reduced version triggering the problem).

bioinformaticspcj commented 3 years ago

Hi, Very thanks. The attached files are my input. 

Could you implement them to fix the bug.

Thanks a lot.

------------------ 原始邮件 ------------------ 发件人: "genometools/genomethreader" <notifications@github.com>; 发送时间: 2020年11月30日(星期一) 凌晨3:10 收件人: "genometools/genomethreader"<genomethreader@noreply.github.com>; 抄送: "Monkeyjun"<2532090267@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [genometools/genomethreader] How to merge mutiple GFF3 files? (#20)

Looks like the problem seems to be related by the relation of sequences and annotations in your input. This would probably be difficult to debug without your input files (or a minimal reduced version triggering the problem).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

从QQ邮箱发来的超大附件

testPB.fa.gz (306.74M, 2020年12月30日 10:12 到期)进入下载页面:http://mail.qq.com/cgi-bin/ftnExs_download?t=exs_ftn_download&k=73623637991be1dca9c3daa11232071f57010154015105521f005206001f01010b561b0f01010c1d565b5504060a5452075a0f03343e354457114267761c53511c054c3709&code=2b674250

testPB.gff.gz (13.39M, 无限期)进入下载页面:http://mail.qq.com/cgi-bin/ftnExs_download?k=20366166f6f919c9f9978df013380219000505510c0b00064f0504565015045301024c04005d551b5b555007510c0155000053513535304207451536771657500418061c3505&t=exs_ftn_download&code=b6af5806

satta commented 3 years ago

I have taken some free time to dig into this today. I was able to reduce the test case to this minimal example: test.gff3.gz

Apparently, when calculating sequence ranges for extraction, we are passed a start coordinate of 0, which then, when subtracted the offset of 1, underflows later during coordinate munging before extraction in GenomeTools:

[...]
Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7ca455b in __GI_abort () at abort.c:79
#2  0x00005555555e09aa in gt_bioseq_get_sequence_range (bs=<optimized out>, idx=<optimized out>, start=start@entry=18446744073709551615, end=end@entry=100)
    at src/core/bioseq.c:388
#3  0x00005555555e226e in gt_bioseq_col_md5_to_seq (sc=<optimized out>, seq=0x7fffffffdaa0, start=18446744073709551615, end=100, md5_seqid=<optimized out>, 
    err=0x5555556827b0) at src/core/bioseq_col.c:287
#4  0x0000555555599673 in gt_region_mapping_get_sequence (rm=0x555555687980, seq=0x7fffffffdaa0, seqid=0x555555688ee0, start=0, end=101, err=0x5555556827b0)
    at src/extended/region_mapping.c:240
#5  0x000055555556123e in find_false_sites (false_don_0_gt=0x55555569c8e0, false_don_0_gc=0x555555690b40, false_acc_0=0x55555569c910, false_don_1_gt=0x5555556940b0, 
    false_don_1_gc=0x555555690b70, false_acc_1=0x5555556940e0, false_don_2_gt=0x555555694110, false_don_2_gc=0x555555690ba0, false_acc_2=0x555555690ab0, 
    seqs=0x555555687bb0, proc_exons=true, region_mapping=0x555555687980, gcdonor=true, err=0x5555556827b0) at src/gth/bssm_seq_processor.c:1089
#6  0x000055555556198f in gth_bssm_seq_processor_find_false_sites (bsp=0x5555556879f0, region_mapping=0x555555687980, err=0x5555556827b0)
    at src/gth/bssm_seq_processor.c:1153
#7  0x000055555555ce4a in gt_gthbssmtrain_runner (argc=7, argv=0x7fffffffdde8, parsed_args=6, tool_arguments=0x555555684870, err=0x5555556827b0)
    at src/gth/gt_gthbssmtrain.c:319
#8  0x000055555558727b in gt_tool_run (tool=tool@entry=0x555555684810, argc=argc@entry=7, argv=argv@entry=0x7fffffffdde8, err=err@entry=0x5555556827b0)
    at src/core/tool.c:107
#9  0x0000555555587e0c in gt_toolobjdriver_with_license (tool_constructor=<optimized out>, version_func=0x55555555c22b <gthversionfunc>, argc=7, argv=0x7fffffffdde8, 
    license_out=license_out@entry=0x0, license_constructor=license_constructor@entry=0x0, license_destructor=0x0) at src/core/tooldriver.c:96
#10 0x0000555555587f4f in gt_toolobjdriver (tool_constructor=<optimized out>, version_func=<optimized out>, argc=<optimized out>, argv=<optimized out>)
    at src/core/tooldriver.c:67
#11 0x000055555555c229 in main (argc=7, argv=0x7fffffffdde8) at src/gthbssmtrain.c:8
(gdb) f 5
#5  0x000055555556123e in find_false_sites (false_don_0_gt=0x55555569c8e0, false_don_0_gc=0x555555690b40, false_acc_0=0x55555569c910, false_don_1_gt=0x5555556940b0, 
    false_don_1_gc=0x555555690b70, false_acc_1=0x5555556940e0, false_don_2_gt=0x555555694110, false_don_2_gc=0x555555690ba0, false_acc_2=0x555555690ab0, 
    seqs=0x555555687bb0, proc_exons=true, region_mapping=0x555555687980, gcdonor=true, err=0x5555556827b0) at src/gth/bssm_seq_processor.c:1089
1089                  had_err = gt_region_mapping_get_sequence(region_mapping,
(gdb) p *intron
$1 = {seqid = 0x555555688ee0, seq = 0x555555687f70, desc = 0x555555690a80, range = {start = 11, end = 77}, reverse = false, phase = 0}
(gdb) p acc_range
$2 = {start = 0, end = 101}
(gdb) f 4
#4  0x0000555555599673 in gt_region_mapping_get_sequence (rm=0x555555687980, seq=0x7fffffffdaa0, seqid=0x555555688ee0, start=0, end=101, err=0x5555556827b0)
    at src/extended/region_mapping.c:240
240       had_err = gt_seq_col_md5_to_seq(rm->seq_col, seq, start - offset,
(gdb) p offset
$3 = 1
(gdb) p start
$4 = 0
(gdb) p start-offset
$5 = 18446744073709551615
(gdb) f 2
#2  0x00005555555e09aa in gt_bioseq_get_sequence_range (bs=<optimized out>, idx=<optimized out>, start=start@entry=18446744073709551615, end=end@entry=100)
    at src/core/bioseq.c:388
388   gt_assert(idx < gt_encseq_num_of_sequences(bs->encseq) && end >= start);
(gdb) p start
$6 = 18446744073709551615

It looks like these function expect GFF3-style coordinates, starting at 1. However, the 0 was calculated in src/gth/bssm_seq_processor.c:1067:

if (!intron->reverse) {
  if (intron->range.start + j >= 50) {
    acc_range.start = intron->range.start + j - 50;
    acc_underflow = false;
  }
  else
    acc_underflow = true;
  acc_range.end = intron->range.start + j + 51;
}

which -- here in the case of j = 39 (and intron->range.start = 11, see above) -- will lead to a acc_range.start of 0.

Simply replacing the >= with > fixes the problem and allows the training to complete on OP's whole data set.

diff --git a/src/gth/bssm_seq_processor.c b/src/gth/bssm_seq_processor.c
index 62bb279..55e6840 100644
--- a/src/gth/bssm_seq_processor.c
+++ b/src/gth/bssm_seq_processor.c
@@ -1021,7 +1021,7 @@ static int find_false_sites(GtArray *false_don_0_gt, GtArray *false_don_0_gc,
              (gcdonor &&
              (cseq[j+1] == 'C' || cseq[j+1] == 'c')))) {
           if (!intron->reverse) {
-            if (intron->range.start + j >= 50) {
+            if (intron->range.start + j > 50) {
               don_range.start = intron->range.start + j - 50;
               don_underflow = false;
             }
@@ -1030,7 +1030,7 @@ static int find_false_sites(GtArray *false_don_0_gt, GtArray *false_don_0_gc,
             don_range.end = intron->range.start + j + 51;
           }
           else {
-            if (intron->range.end >= j + 51) {
+            if (intron->range.end > j + 51) {
               don_range.start = intron->range.end - j - 51;
               don_underflow = false;
             }
@@ -1063,7 +1063,7 @@ static int find_false_sites(GtArray *false_don_0_gt, GtArray *false_don_0_gc,
                   (cseq[j]   == 'A' || cseq[j]   == 'a') &&
                   (cseq[j+1] == 'G' || cseq[j+1] == 'g')) {
           if (!intron->reverse) {
-            if (intron->range.start + j >= 50) {
+            if (intron->range.start + j > 50) {
               acc_range.start = intron->range.start + j - 50;
               acc_underflow = false;
             }
@@ -1072,7 +1072,7 @@ static int find_false_sites(GtArray *false_don_0_gt, GtArray *false_don_0_gc,
             acc_range.end = intron->range.start + j + 51;
           }
           else {
-            if (intron->range.end >= j + 51) {
+            if (intron->range.end > j + 51) {
               acc_range.start = intron->range.end - j - 51;
               acc_underflow = false;
             }

@gordon do you have any comments or objections? I'm not too familiar with the intricacies of the code in this spot. If it's OK I can prepare a PR.

bioinformaticspcj commented 3 years ago

Many thanks for your debugging. It looks working well now. Looking forward to the new version of the program.

------------------ 原始邮件 ------------------ 发件人: "genometools/genomethreader" <notifications@github.com>; 发送时间: 2020年12月4日(星期五) 上午6:31 收件人: "genometools/genomethreader"<genomethreader@noreply.github.com>; 抄送: "Monkeyjun"<2532090267@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [genometools/genomethreader] How to merge mutiple GFF3 files? (#20)

I have taken some free time to dig into this today. I was able to reduce the test case to this minimal example: test.gff3.gz

Apparently, when calculating sequence ranges for extraction, we are passed a start coordinate of 0, which then, when subtracted the offset of 1, underflows later during coordinate munging before extraction in GenomeTools: [...] Program received signal SIGABRT, Aborted. GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffff7ca455b in GI_abort () at abort.c:79 #2 0x00005555555e09aa in gt_bioseq_get_sequence_range (bs=<optimized out>, idx=<optimized out>, start=start@entry=18446744073709551615, end=end@entry=100) at src/core/bioseq.c:388 #3 0x00005555555e226e in gt_bioseq_col_md5_to_seq (sc=<optimized out>, seq=0x7fffffffdaa0, start=18446744073709551615, end=100, md5_seqid=<optimized out>, err=0x5555556827b0) at src/core/bioseq_col.c:287 #4 0x0000555555599673 in gt_region_mapping_get_sequence (rm=0x555555687980, seq=0x7fffffffdaa0, seqid=0x555555688ee0, start=0, end=101, err=0x5555556827b0) at src/extended/region_mapping.c:240 #5 0x000055555556123e in find_false_sites (false_don_0_gt=0x55555569c8e0, false_don_0_gc=0x555555690b40, false_acc_0=0x55555569c910, false_don_1_gt=0x5555556940b0, false_don_1_gc=0x555555690b70, false_acc_1=0x5555556940e0, false_don_2_gt=0x555555694110, false_don_2_gc=0x555555690ba0, false_acc_2=0x555555690ab0, seqs=0x555555687bb0, proc_exons=true, region_mapping=0x555555687980, gcdonor=true, err=0x5555556827b0) at src/gth/bssm_seq_processor.c:1089 #6 0x000055555556198f in gth_bssm_seq_processor_find_false_sites (bsp=0x5555556879f0, region_mapping=0x555555687980, err=0x5555556827b0) at src/gth/bssm_seq_processor.c:1153 #7 0x000055555555ce4a in gt_gthbssmtrain_runner (argc=7, argv=0x7fffffffdde8, parsed_args=6, tool_arguments=0x555555684870, err=0x5555556827b0) at src/gth/gt_gthbssmtrain.c:319 #8 0x000055555558727b in gt_tool_run (tool=tool@entry=0x555555684810, argc=argc@entry=7, argv=argv@entry=0x7fffffffdde8, err=err@entry=0x5555556827b0) at src/core/tool.c:107 #9 0x0000555555587e0c in gt_toolobjdriver_with_license (tool_constructor=<optimized out>, version_func=0x55555555c22b <gthversionfunc>, argc=7, argv=0x7fffffffdde8, license_out=license_out@entry=0x0, license_constructor=license_constructor@entry=0x0, license_destructor=0x0) at src/core/tooldriver.c:96 #10 0x0000555555587f4f in gt_toolobjdriver (tool_constructor=<optimized out>, version_func=<optimized out>, argc=<optimized out>, argv=<optimized out>) at src/core/tooldriver.c:67 #11 0x000055555555c229 in main (argc=7, argv=0x7fffffffdde8) at src/gthbssmtrain.c:8 (gdb) f 5 #5 0x000055555556123e in find_false_sites (false_don_0_gt=0x55555569c8e0, false_don_0_gc=0x555555690b40, false_acc_0=0x55555569c910, false_don_1_gt=0x5555556940b0, false_don_1_gc=0x555555690b70, false_acc_1=0x5555556940e0, false_don_2_gt=0x555555694110, false_don_2_gc=0x555555690ba0, false_acc_2=0x555555690ab0, seqs=0x555555687bb0, proc_exons=true, region_mapping=0x555555687980, gcdonor=true, err=0x5555556827b0) at src/gth/bssm_seq_processor.c:1089 1089 had_err = gt_region_mapping_get_sequence(region_mapping, (gdb) p *intron $1 = {seqid = 0x555555688ee0, seq = 0x555555687f70, desc = 0x555555690a80, range = {start = 11, end = 77}, reverse = false, phase = 0} (gdb) p acc_range $2 = {start = 0, end = 101} (gdb) f 4 #4 0x0000555555599673 in gt_region_mapping_get_sequence (rm=0x555555687980, seq=0x7fffffffdaa0, seqid=0x555555688ee0, start=0, end=101, err=0x5555556827b0) at src/extended/region_mapping.c:240 240 had_err = gt_seq_col_md5_to_seq(rm->seq_col, seq, start - offset, (gdb) p offset $3 = 1 (gdb) p start $4 = 0 (gdb) p start-offset $5 = 18446744073709551615 (gdb) f 2 #2 0x00005555555e09aa in gt_bioseq_get_sequence_range (bs=<optimized out>, idx=<optimized out>, start=start@entry=18446744073709551615, end=end@entry=100) at src/core/bioseq.c:388 388 gt_assert(idx < gt_encseq_num_of_sequences(bs->encseq) && end >= start); (gdb) p start $6 = 18446744073709551615
It looks like these function expect GFF3-style coordinates, starting at 1. However, the 0 was calculated in src/gth/bssm_seq_processor.c:1067: if (!intron->reverse) { if (intron->range.start + j >= 50) { acc_range.start = intron->range.start + j - 50; acc_underflow = false; } else acc_underflow = true; acc_range.end = intron->range.start + j + 51; }

which -- here in the case of j = 39 (and intron->range.start = 11, see above) -- will lead to a acc_range.start of 0.

Simply replacing the >= with > fixes the problem and allows the training to complete on OP's whole data set. diff --git a/src/gth/bssm_seq_processor.c b/src/gth/bssm_seq_processor.c index 62bb279..55e6840 100644 --- a/src/gth/bssm_seq_processor.c +++ b/src/gth/bssm_seq_processor.c @@ -1021,7 +1021,7 @@ static int find_false_sites(GtArray false_don_0_gt, GtArray false_don_0_gc, (gcdonor && (cseq[j+1] == 'C' || cseq[j+1] == 'c')))) { if (!intron->reverse) { - if (intron->range.start + j >= 50) { + if (intron->range.start + j > 50) { don_range.start = intron->range.start + j - 50; don_underflow = false; } @@ -1030,7 +1030,7 @@ static int find_false_sites(GtArray false_don_0_gt, GtArray false_don_0_gc, don_range.end = intron->range.start + j + 51; } else { - if (intron->range.end >= j + 51) { + if (intron->range.end > j + 51) { don_range.start = intron->range.end - j - 51; don_underflow = false; } @@ -1063,7 +1063,7 @@ static int find_false_sites(GtArray false_don_0_gt, GtArray false_don_0_gc, (cseq[j] == 'A' || cseq[j] == 'a') && (cseq[j+1] == 'G' || cseq[j+1] == 'g')) { if (!intron->reverse) { - if (intron->range.start + j >= 50) { + if (intron->range.start + j > 50) { acc_range.start = intron->range.start + j - 50; acc_underflow = false; } @@ -1072,7 +1072,7 @@ static int find_false_sites(GtArray false_don_0_gt, GtArray false_don_0_gc, acc_range.end = intron->range.start + j + 51; } else { - if (intron->range.end >= j + 51) { + if (intron->range.end > j + 51) { acc_range.start = intron->range.end - j - 51; acc_underflow = false; }

@gordon do you have any comments or objections? I'm not too familiar with the intricacies of the code in this spot. If it's OK I can prepare a PR.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

satta commented 3 years ago

I'd leave the actual merge & release to someone with a better idea of how these changes might impact the output.

gordon commented 3 years ago

@satta: Many thanks for looking into this! Could you prepare a PR for your change? It looks good, but I want to test it on the old testsuite to see if it changes any previous results.

satta commented 3 years ago

@satta: Many thanks for looking into this! Could you prepare a PR for your change? It looks good, but I want to test it on the old testsuite to see if it changes any previous results.

Sure, see https://github.com/genometools/genomethreader/pull/23

gordon commented 3 years ago

Great, thanks. Testing...

bioinformaticspcj commented 3 years ago

Hi, has the test finished?