elzbth / jitterbug

Jitterbug is a bioinformatic software that predicts insertion sites of transposable elements in a sample sequenced by short paired-end reads with respect to an assembled reference.
17 stars 8 forks source link

Jitterbug does not recognize transposon annotation neither .gff and .fasta format #4

Closed padmoo closed 8 years ago

padmoo commented 8 years ago

Hi everyone,

I'm a bionformatics newbie and I'm trying to use jitterbug. I'm running the following command: python jitterbug.py LIB17997_good_align_sorted.bam Thaps3_chromosomes_geneModels_FilteredModels2.gff

and I get the following error: must supply either a TE annotation in .gff or .bed format (-t) OR a list of reference TE sequences in fasta format (-T) this program identifies putative TE insertion sites in a sequenced sample with respect to a reference genome sequence

my gff file looks like this: chr_1 JGI start_codon 10354 10356 . - 0 name "fgenesh1_pg.C_chr_1000001" chr_1 JGI exon 17783 17942 . - . name "fgenesh1_pg.C_chr_1000003"; transcriptId 869 chr_1 JGI CDS 17783 17942 . - 0 name "fgenesh1_pg.C_chr_1000003"; proteinId 869; exonNumber 50 chr_1 JGI exon 17986 18391 . - . name "fgenesh1_pg.C_chr_1000003"; transcriptId 869 chr_1 JGI CDS 17986 18391 . - 1 name "fgenesh1_pg.C_chr_1000003"; proteinId 869; exonNumber 49

my fasta file like this: '>Chromosomefgenesh1_pg.C_chr_1000001_Tp_Ambal2:classI:LINE:Ambal_chr_1_11992425- TCCGGCAACTGAAGGCACTCGCAGTACTGATTGATAATGCATTTATACCAAAATTGGGTA TGAGTAGCAAAATGAAGAGGATCGCAGTGTATGCACCGCTTGAACTGGGAGGAGCGAATT TCCCGAGTATCGAAAGTCTCCAGGATCAGATGGGAATTGACCACTTTGTTCGCTCAGTTC AATGGGGGAAAGAGCTGGCCACTGACATCCGGATTGTACTGTCTCGAGTACAACTTTACT

Does anybody have an idea what I'm doing wrong?

Thank you!

mbosio85 commented 8 years ago

Hi,

I don't know what is going wrong with certainty but let's try to find it. Could you check that you are using the current version of jitterbug? I don't find in the code the error you report so maybe is an older version, or it is related to external utils.

If the error keeps coming up, could you please post all the output from jitterbug, since when you launch it till when it dies ? [better if it is within a code box]

like this

Anyways, looking at the GFF format from your file looks slightly different than the one used in the data folder

TAIR_ID=AT1TE00010;Name=AT1TE00010;Family=ATCOPIA24;ID=Ath_TE_3_ATCOPIA24

could you try to change your format to fit the proposed one with Name=XXX[unique value] ?

Thanks

Mattia

padmoo commented 8 years ago

Hi Mattia,

thank you for helping me with the problem.

I've downloaded the new version of jitterbug and it seems to be running further than before (it is still running).

I saw in the README file that I need a gff3, is that correct?

I've formated the gff now to 9 columns and it looks like this now: chr_1 JGI gene 300 10356 . - . Name=867 Is this sufficient?

I have a new error when I use the above formated gff3 file: `python jitterbug.py LIB17997_good_align_sorted.bam TE_gff.gff3 processing LIB17997_good_align_sorted.bam starting at 2016-07-07 21:29:36.237351 calculating mean insert size... blah mean fragment length over 1000000 reads: 444.00 standard deviation of fragment_length: 115.90 mean read length: 99.54 standard deviation of read length: 5.52 WARNING: fragment length standard deviation seems way too large to be realistic.\n There is maybe something weird with the flags in your bam mapping, or a very large number of large SV \n that are messing up the count.\n Setting the stdev to 0.1*fragment_length = 44.40 for downstream calculations elapsed time: 0:00:07.136919 selecting discordant reads... done selecting discordant reads. elapsed time: 0:11:59.032886 selecting discordant read pairs where exactly one maps to a TE...

jitterbugdbfile.sqlite

Traceback (most recent call last): File "jitterbug.py", line 129, in main(sys.argv[1:]) File "jitterbug.py", line 125, in main args.numCPUs, args.bin_size, args.minMAPQ, generate_test_bam, args.pre_filter, args.conf_lib_stats, mem, args.min_cluster_size) File "/Users/Katrin/Desktop/UEA/Evolution experiment/Transposons/jitterbug-master/Run_TE_ID_reseq.py", line 198, in run_jitterbug read_pair_one_overlap_TE_list = discordant_bam_reader.select_read_pair_one_overlap_TE_annot(te_annot, interval_size, min_mapq,database_file,bin_size) File "/Users/Katrin/Desktop/UEA/Evolution experiment/Transposons/jitterbug-master/BamReader.py", line 167, in select_read_pair_one_overlap_TE_annot read_pair.TE_annot_attr_list.extend([ gff_interval.attrs for gff_interval in overlapping_TE_annots ]) File "pybedtools/cbedtools.pyx", line 390, in pybedtools.cbedtools.Interval.attrs.get (pybedtools/cbedtools.cxx:5769) File "pybedtools/cbedtools.pyx", line 180, in pybedtools.cbedtools.Attributes.init (pybedtools/cbedtools.cxx:2312) ValueError: need more than 1 value to unpack ` I'm guessing there might be something wrong with the pybedtools installation?

Thanks, Katrin

mbosio85 commented 8 years ago

Hi Katrin,

The line returning the error is this 167 in BamReader.py read_pair.TE_annot_attr_list.extend([ gff_interval.attrs for gff_interval in overlapping_TE_annots ])

I found this about pybedtools returning the same error and it look like it's a formatting issue: https://groups.google.com/forum/#!topic/bedtools-discuss/N_--hAVeB1o

Could you check the pybedtools version on their github in case it needs updating? On my machine I have pybedtools (0.6.6)

I tested a small code block that you can replicate and verify if you need to change something in your gff file or in pybedtools to make it work: The idea is to replicate just the bare minimum lines that generate the error you receive:

For one of the provided gff files import pybedtools TE_annot='TAIR10_Henaff2014PlantJ_annot.gff3' TE_annot_intervals = pybedtools.IntervalFile(TE_annot) map_interval = pybedtools.Interval('chr1',1, 1000000, strand='+') #here you should put chr_1 not 'chr1 for your gff file overlapping_TE_annots = TE_annot_intervals.all_hits(map_interval) problematic_list = list() if len(overlapping_TE_annots) > 0: ... problematic_list.extend([ gff_interval.attrs for gff_interval in overlapping_TE_annots ]) print problematic_list With the provided gff file you have an output like this:

[{'ID': 'Ath_TE_3_ATCOPIA24', 'Name': 'AT1TE00010', 'Family': 'ATCOPIA24', ...

I tested it with the gff line you wrote above And it workded fine:

.... print problematic_list `[{'Name': '867'}]``

Let me know how it goes

Mattia

padmoo commented 8 years ago

Hi Mattia,

my pybedtools version is pybedtools 0.7.7

I tried the code block you provided both with the provided gff3 and my gff3 (changed the chr_ to chr) and I do not get the output you showed. I get the following message: `import pybedtools TE_annot='TAIR10_Henaff2014PlantJ_annot.gff3' TE_annot_intervals = pybedtools.IntervalFile(TE_annot) map_interval = pybedtools.Interval('chr1',1, 1000000, strand='+')

here you should put chr_1 not 'chr1 for your gff file

overlapping_TE_annots = TE_annot_intervals.all_hits(map_interval) problematic_list = list() if len(overlapping_TE_annots) > 0: problematic_list.extend([ gff_interval.attrs for gff_interval in overlapping_TE_annots ]) File "", line 2 problematic_list.extend([ gff_interval.attrs for gff_interval in overlapping_TE_annots ]) ^ IndentationError: expected an indented block print problematic_list []` My gff3 brings up the same error. Google tells me that there is a problem with tabs and spaces? But looking at the code I cannot see where the problem is.

My gff3 looks like this now: chr1 JGI gene 300 10356 . - . Name=867;ID=fgenesh1_pg.C_chr1000001;Tp_Ambal2:classI:LINE:Ambal

My python version 2.7.10

Thank you!

Katrin

mbosio85 commented 8 years ago

Hi Katrrin this error `is quickly fixed, just add spacing before problematic_list.extend([ gff_interval.attrs for gff_interval in overlapping_TE_annots ])` I added '...' before to simulate that because the code formatting block removes spaces at the beginning of a line But seeing your package and python version I do not understand yet why it's not working though :S let me know the output Mattia

padmoo commented 8 years ago

Hi Mattia,

I've added 3 spaces to the last two 2 lines. I'm getting the Traceback (most recent call last): File "<stdin>", line 2, in <module> File "cbedtools.pyx", line 338, in pybedtools.cbedtools.Interval.attrs.__get__ (pybedtools/cbedtools.cpp:4377) File "cbedtools.pyx", line 139, in pybedtools.cbedtools.Attributes.__init__ (pybedtools/cbedtools.cpp:1630) ValueError: need more than 1 value to unpack

Error again. I've uninstalled my pybedtools version and installed yours (I got a lot of warnings and had to install cython too).

According to the link you posted, there is an issue with the = and semicolons? I don't understand why it works for you but not me :( Could there be general issues with the pybedtools installation? I tried installing with pip install pybedtools at the beginning but it would do it because of denied access. Then I did sudo pip install pybedtools which seemed to have worked (looked like it to me). Now I've installed the 0.6.6 version with sudo easy_install pybedtools==0.6.6. When doing the sudo pip install I get a yellow message that I might want the -H flag. Does that give you more insights into what could be wrong?

Thanks a lot for taking the time to help me!

Katrin

mbosio85 commented 8 years ago

Hmm, I have never seen the pip -H flag actually.

Here http://stackoverflow.com/questions/5226311/installing-specific-package-versions-with-pip maybe helps in cleaning up multiple versions of the same package.

Frankly it's the first time I see this error, I hope it solves.

Just in case please post here the pip -H message. If by monday it's not resolved we can try to replicate your full environment here and see if we can make it work backwards

Mattia

padmoo commented 8 years ago

I've uninstalled the pybedtools version 0.7.7, when installing version 0.6.6 with sudo I get The directory '/Users/Katrin/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/Katrin/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Requirement already satisfied (use --upgrade to upgrade): pybedtools==0.6.6 in /Library/Python/2.7/site-packages/pybedtools-0.6.6-py2.7-macosx-10.11-intel.egg

Katrin

mbosio85 commented 8 years ago

Ok, so -H is for sudo: sudo -h ... -H set HOME variable to target user's home dir. basically I think it will install it in your home directory. It looks like you do not have enough permissions to execute sudo in pip folder.

Anyways, it also says: `Requirement already satisfied (use --upgrade to upgrade): pybedtools==0.6.6 in /Library/Python/2.7/site-packages/pybedtools-0.6.6-py2.7-macosx-10.11-intel.egg``

So pybedtools 0.6.6 should be installed, can you check with python and see if the code runs?

Mattia

padmoo commented 8 years ago

If you mean the following with checking if the command runs: `Katrins-MacBook-Pro:jitterbug-master Katrin$ pybedtools pybedtools utility scripts:

annotate : annotate a file with the neearest features in another.

intersection_matrix : Creates a pairwise matrix containing overlapping feature counts for many BED files

intron_exon_reads : Third quick example from the documentation -- count reads introns and exons, in parallel

peak_pie : Make a pie chart of features overlapping annotations (e.g., peaks in introns, exons, etc)

py_ms_example : Runs Python example from the manuscript

pybedtools_demo : Quick demo of some pybedtools functionality

venn_gchart : Create a 3-way Venn diagram using Google Charts API

venn_mpl : Create a 3-way Venn diagram, using matplotlib`

Then, yes it does.

mbosio85 commented 8 years ago

Cool,

now if you try the code lines that were giving the error before with the gff file, is it still crashing ?

padmoo commented 8 years ago

Yes, it does.

BUT with your provided gff3 file I get the following output: `Python 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information.

import pybedtools TE_annot='TAIR10_Henaff2014PlantJ_annot.gff3' TE_annot_intervals = pybedtools.IntervalFile(TE_annot) map_interval = pybedtools.Interval('chr1',1, 1000000, strand='+')

here you should put chr_1 not 'chr1 for your gff file

... overlapping_TE_annots = TE_annot_intervals.all_hits(map_interval) problematic_list = list() if len(overlapping_TE_annots) > 0: ... problematic_list.extend([ gff_interval.attrs for gff_interval in overlapping_TE_annots ]) ... print problematic_list ... [{'ID': 'Ath_TE_3_ATCOPIA24', 'Name': 'AT1TE00010', 'Family': 'ATCOPIA24', 'TAIR_ID': 'AT1TE00010'}, {'ID': 'Ath_TE_4_ATREP4', 'Name': 'AT1TE00020', 'Family': 'ATREP4', 'TAIR_ID': 'AT1TE00020'}, {'ID': 'Ath_TE_5_ATREP3', 'Name': 'AT1TE00025', 'Family': 'ATREP3', 'TAIR_ID': 'AT1TE00025'}, {'ID': 'Ath_TE_6_TA11', 'Name': 'AT1TE00220', 'Family': 'TA11', 'TAIR_ID': 'AT1TE00220'}, {'ID': 'Ath_TE_7_HELITRONY1D', 'Name': 'AT1TE00225', 'Family': 'HELITRONY1D', 'TAIR_ID': 'AT1TE00225'}, {'ID': 'Ath_TE_8_ATLINE1_3A', 'Name': 'AT1TE00470', 'Family': 'ATLINE1_3A', 'TAIR_ID': 'AT1TE00470'}, {'ID': 'Ath_TE_9_ATREP5', 'Name': 'AT1TE00600', 'Family': 'ATREP5', 'TAIR_ID': 'AT1TE00600'}]`

:) :) :)

So, the format in my gff file is still wrong?

mbosio85 commented 8 years ago

Just to be sure because it did ran when I tried:

padmoo commented 8 years ago

Hi Mattia,

it is working now. I've changed the : into _ as well and added a type= My first line of my gff3 looks like this now: chr1 JGI gene 300 10356 . - . Name=867;ID=fgenesh1_pg.C_chr1000001;type=Tp_Ambal2_classI_LINE_Ambal And I get the output: [{'type': 'Tp_Ambal2_classI_LINE_Ambal', 'Name': '867', 'ID': 'fgenesh1_pg.C_chr1000001'}]

So it seems to be fine now.

I'm gonna try running the whole script now.

mbosio85 commented 8 years ago

nice :) Hope that now Jitterbug will run with your data. Otherwise let me know and we'll go through what pops up

padmoo commented 8 years ago

It seems to run without errors now but there are other problems: `Katrins-MacBook-Pro:jitterbug-master Katrin$ python jitterbug.py LIB17997_good_align_sorted.bam TE_gff.gff3 processing LIB17997_good_align_sorted.bam starting at 2016-07-08 17:34:05.480262 calculating mean insert size... blah mean fragment length over 1000000 reads: 444.00 standard deviation of fragment_length: 115.90 mean read length: 99.54 standard deviation of read length: 5.52 WARNING: fragment length standard deviation seems way too large to be realistic.\n There is maybe something weird with the flags in your bam mapping, or a very large number of large SV \n that are messing up the count.\n Setting the stdev to 0.1*fragment_length = 44.40 for downstream calculations elapsed time: 0:00:07.066989 selecting discordant reads... done selecting discordant reads. elapsed time: 0:11:52.239277 selecting discordant read pairs where exactly one maps to a TE...

jitterbugdbfile.sqlite

[] number discordant read pairs with exactly one read overlapping a TE: 0 elapsed time: 0:34:03.401173 generating clusters... Generating clusters for each bin **total fwd single clusters found: 0 **total rev single clusters found: 0 **total cluster pairs found: 0 elapsed time: 0:34:03.415165 writing clustered reads to bam file, writing to gff and tables... 0 elapsed time: 0:34:03.415714 done! at 2016-07-08 18:08:08.896083`

I've already started with redoing the bam file but I'm not sure what the issue with the standard deviation means. Anyways, it is not a problem with the tool I think.

Thanks a lot for getting me here!!!

Katrin

padmoo commented 8 years ago

Oh my, I have a new error:

`Katrins-MacBook-Pro:jitterbug-master Katrin$ python jitterbug.py LIB17997_good_align_sorted.bam TE_gff.gff3 processing LIB17997_good_align_sorted.bam starting at 2016-07-08 22:06:04.873587 calculating mean insert size... blah mean fragment length over 1000000 reads: 444.00 standard deviation of fragment_length: 115.90 mean read length: 99.54 standard deviation of read length: 5.52 WARNING: fragment length standard deviation seems way too large to be realistic.\n There is maybe something weird with the flags in your bam mapping, or a very large number of large SV \n that are messing up the count.\n Setting the stdev to 0.1*fragment_length = 44.40 for downstream calculations elapsed time: 0:00:07.182232 selecting discordant reads... done selecting discordant reads. elapsed time: 0:12:43.752674 selecting discordant read pairs where exactly one maps to a TE...

jitterbugdbfile.sqlite

Traceback (most recent call last): File "jitterbug.py", line 129, in main(sys.argv[1:]) File "jitterbug.py", line 125, in main args.numCPUs, args.bin_size, args.minMAPQ, generate_test_bam, args.pre_filter, args.conf_lib_stats, mem, args.min_cluster_size) File "/Users/Katrin/Desktop/UEA/Evolution experiment/Transposons/jitterbug-master/Run_TE_ID_reseq.py", line 198, in run_jitterbug read_pair_one_overlap_TE_list = discordant_bam_reader.select_read_pair_one_overlap_TE_annot(te_annot, interval_size, min_mapq,database_file,bin_size) File "/Users/Katrin/Desktop/UEA/Evolution experiment/Transposons/jitterbug-master/BamReader.py", line 207, in select_read_pair_one_overlap_TEannot tabname="[c%s%d%d%s]"%(read_pair.interval_chr,bin_size(int(read_pair.interval_start)/bin_size),bin_size(1+int(read_pair.interval_start)/bin_size) ,read_pair.interval_direction ) ZeroDivisionError: integer division or modulo by zero`

mbosio85 commented 8 years ago

Hi,

it should be a quick fix because it divides by zero /bin_size and the default value is 0 So: try this: python jitterbug.py LIB17997_good_align_sorted.bam TE_gff.gff3 --bin_size 50000000

If it runs, try to update the code and run it again without the --bin_size [I just updated it and hope it works]

Btw: looking at your first output:

Generating clusters for each bin ***total fwd single clusters found: 0 > >****_total rev single clusters found: 0 *_**total cluster pairs found: 0

I am concerned that you don't get many results since there's no cluster found [but maybe that was just a toy example you posted]

Mattia

padmoo commented 8 years ago

Hi Mattia,

I've run it with the changed bin size and it seems to work: ******************total fwd single clusters found: 284 ******************total rev single clusters found: 267 ******************total cluster pairs found: 35 elapsed time: 1:45:25.788983 writing clustered reads to bam file, writing to gff and tables... 26 elapsed time: 1:45:25.818936 done! at 2016-07-18 18:07:58.481180

and the files and tables look scary but interesting!

I'm gonna try the cancer pipeline next as I have two samples I would like to compare.

What does the bin size refer too and why did you choose 50000000?

mbosio85 commented 8 years ago

Hi,

binsize is the size in bp that we use to split each chromosome and count the read_pairs with proper features in each one of them. We use it so we can parallelize the processing in multiple threads and calculate the overlaps for each bin separately. We chose 500... so that it's not too big or too short to somehow "optimize" the runtime performances. The final result using any binsize is the same [except for that weird case where CPU=1 and binsize=0 .. I'll work on it to fix it :) ]

Hope it solves your doubts.

Cheers, Mattia

padmoo commented 8 years ago

Hi Mattia,

thanks for the explanation! I'm pretty new to all this and it helps my understanding.

Regards, Katrin