dputhier / pygtftk

A python package and a set of shell commands to handle GTF files
GNU General Public License v3.0
45 stars 6 forks source link

[peak_anno] Undefined chromosome error #66

Closed Norisad closed 5 years ago

Norisad commented 5 years ago

I had an issue with the -z command, returning this error message.

 |-- 14:07-ERROR-peak_anno : Chromosome  i from GTF is undefined in /gpfs/tagc/home/sadouni/tf/split_tf/EOMES.bed file.

Quentin

dputhier commented 5 years ago

could you provide the other argument used.

Norisad commented 5 years ago

After testing, -z may not be the problem. Reusing an actual gtf did not solve it, what worked was using an actual chromsize file instead of -c mm9

dputhier commented 5 years ago

And when using -c mm9 and -K toto, could you paste the content of the chrom_size file produced ?

dputhier commented 5 years ago

I just checked -c mm9 with an example and could not see any problem. I guess it is something more subtle.

dputhier commented 5 years ago

What about the format of

/gpfs/tagc/home/sadouni/tf/split_tf/EOMES.bed

It seems that the message is rather explicit telling that there is a chromosome named i in EOMES.bed. I agree this is weird but I would like to check. Could you provide the file as attached document ?

dputhier commented 5 years ago

And also the gtf used.

dputhier commented 5 years ago

Sorry. This i chromosome is related to a bug:

The line

  message("Chromosome " + " i from GTF is undefined in --chrom-info file.",
                        type="ERROR")

should be

message("Chromosome " + i + " from GTF is undefined in --chrom-info file.", type="ERROR")

I fixed it in ba205ddecff8b59564a57dde63a5dc491c167675

dputhier commented 5 years ago

Could you rerun and give the new error message plz. Best

qferre commented 5 years ago

I can reproduce the bug with my own files : I have ran into this error again, independently, when trying to process random peaks generated from a human genome. I will use your fixed code and report back.

qferre commented 5 years ago

The fix in ba205ddecff8b59564a57dde63a5dc491c167675 was incomplete. I completed the fix in 4b6d6eaf5afed42055bd5af92c3c8e12a973cb61 on another line.

qferre commented 5 years ago

In my own test files, the problematic chromosome was "chr3_KI270779v1_alt". I assume the chromsizes you download by default do not include the "alt" ones ?

dputhier commented 5 years ago

Maybe we could add an argument so that the program may continue when a chromosome from the bed file is not define is the genome the program.

dputhier commented 5 years ago

I mean from the GTF file...

qferre commented 5 years ago

A possibility is to remove all lines whose first block (the chromosome) is not in the known chromosomes. Can this be done easily in pybedtools ?

dputhier commented 5 years ago

OK. We fixed the 'i' that was written instead of the corresponding chr name... Moreover I added three novel arguments to silently delete features located on chromosomes unknown in --chrom-info file and found in (i) gtf files, (ii) peak files and (iii) more-bed files.

 -f, --force-chrom-gtf       Discard silently from GTF genes outside chromosomes defined in --chrom-info. (default: False)
 -w, --force-chrom-peak      Discard silently from --peak-file peaks outside chromosomes defined in --chrom-info. (default: False)
 -q, --force-chrom-more-bed  Discard silently from --more-bed files regions outside chromosomes defined in --chrom-info. (default: False)

This is implemented in 4356559f7f3bc95ad8e10df4f6112dca47f0b03a