daler / metaseq

Framework for integrated analysis and plotting of ChIP/RIP/RNA/*-seq data
https://daler.github.io/metaseq
MIT License
87 stars 36 forks source link

genomic_signal #29

Open kellermac opened 8 years ago

kellermac commented 8 years ago

Hello again. I am making an average plot of Bigwig (signal) files and a gff3 reference file (midpoint of binding sites). I have succesfully used this script before, but when I substitute this new gff3 file, I receive the following error message: $ python hts.py Traceback (most recent call last): File "hts.py", line 49, in processes=processes) File "/usr/local/lib/python2.7/dist-packages/metaseq/_genomic_signal.py", line 122, in array chunksize=chunksize, **kwargs) File "/usr/local/lib/python2.7/dist-packages/metaseq/array_helpers.py", line 371, in _array_parallel chunks = list(chunker(genelist, chunksize)) File "/usr/local/lib/python2.7/dist-packages/metaseq/helpers.py", line 27, in chunker x.append(f.next()) File "pybedtools/cbedtools.pyx", line 787, in pybedtools.cbedtools.IntervalIterator.next (pybedtools/cbedtools.cxx:11136) File "pybedtools/cbedtools.pyx", line 697, in pybedtools.cbedtools.create_interval_from_list (pybedtools/cbedtools.cxx:9933) pybedtools.cbedtools.MalformedBedLineError: Start is greater than stop

The final line "MalformedBedLineError: Start is greater than stop". Indicated to me that my Gff3 reference file was malformed. So I verified the start/ stop positions of that file and found them to be correct (ie start is less than stop). Any advice would be greatly appreciated. Thanks for all the help!

daler commented 8 years ago

Tough to figure anything out without the input files. Can you provide an example of a gff file that causes this error?

kellermac commented 8 years ago

shortcis.txt

daler commented 8 years ago

On this file, I'm able to do this without errors:

import pybedtools
for i in pybedtools.BedTool('shortcis.txt'):
    print(i.chrom, i.start, i.stop)

Are you able to do the same on the full file?

kellermac commented 8 years ago

Yes, here is the first line of the output: (u'chr2R', 702320, 702325)

not sure what the significance of the 'u' is. I'll attach the full gff3. htscis.txt

daler commented 8 years ago

Hmm, seems fine. The u is just Python 2 telling you it's unicode.

If you convert to BED do you get the same error? It could be the lack of proper GFF/GTF formatting (e.g., no actual attributes in the attributes field) that's causing the problem.

To convert:

awk -F "\t" '{OFS="\t"; print $1, $4-1, $5}' htscis.txt > htscis.bed
kellermac commented 8 years ago

Can gffutils make a database from BED files? I have used the same formatting before (with blank fields). But thank you for the sanity check.

daler commented 8 years ago

Nope, it can't make a db from bed files.

Would you send a minimal complete example of the data and code that reproduces the problem? I'm going to need to look more closely at it for debugging.

kellermac commented 8 years ago

python script: htshort.txt

signal file: (I was using Bigwigs but switched to gff3 in order to send it on this message board.

h3k27ac.txt

Window file: I have sent previously as htscis.txt (it is gff3)

daler commented 8 years ago

Are you sure you can reproduce the error with these files? The h3k27ac.txt one doesn't have the same chromosome names as htscis.txt, and there's a lot of extra stuff in htshort.txt, including paths to directories and a database that I'm not sure are relevant, that prevents me from running it.

kellermac commented 8 years ago

Ah I am sorry. Oddly enough it did reproduce the same error message. The chromosome name will definately cause a subsequent error. As I am sure you know the fly community is inconsistent about whether or not to include a 'chr' prefix. The htscis file is cis interaction regions so they are all on 2R (will the absense of the other chromosomes affect compatability?). I removed the chr prefix: htscis.txt

I left the paths in there just to make it obvious that this is where paths should go. I used the db_dir and data_dir conventions from your example 1. You could replace both of these paths with your download directory. (the database was made from the htscis.gff3 file using gffutils) Here is the script I use to make databases: db.txt