erhard-lab / price

Improved Ribo-seq enables identification of cryptic translation events
10 stars 0 forks source link

invalid reference index -1 #9

Closed m-swirski closed 5 years ago

m-swirski commented 6 years ago

Hi all, I get following error while executing Bam2CIT function (the same applies for gedi -e Price), while working on sorted, indexed BAM file that is otherwise just fine for any other task.

2018-09-04 18:14:17.096 INFO Command: gedi -e Bam2CIT -p riboseq.cit kmarx_riboseqAligned.sortedByCoord.out.bam 2018-09-04 18:14:17.232 INFO Discovering classes in classpath 2018-09-04 18:14:17.316 INFO Preparing simple class references 2018-09-04 18:14:17.422 INFO Gedi 1.0.2 (JAR) startup \ IV-:1188365-1188395 [1] x1 192813 (56246.8/sec)Exception in thread "main" java.lang.IllegalArgumentException: Invalid reference index -1 at htsjdk.samtools.QueryInterval.(QueryInterval.java:24) at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:504) at gedi.region.bam.BamGenomicRegionStorage$BufferingGenomicRegionSpliterator.iterator(BamGenomicRegionStorage.java:992) at gedi.region.bam.BamGenomicRegionStorage$BufferingGenomicRegionSpliterator.tryAdvance(BamGenomicRegionStorage.java:1037) at gedi.util.functions.MappedSpliterator.tryAdvance(MappedSpliterator.java:54) at gedi.util.functions.SpliteratorArraySpliterator.tryAdvance(SpliteratorArraySpliterator.java:62) at gedi.util.functions.MappedSpliterator.tryAdvance(MappedSpliterator.java:54) at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681) at gedi.util.functions.EI$4.hasNext(EI.java:141) at gedi.util.FunctorUtils$InitActionIterator.hasNext(FunctorUtils.java:679) at gedi.util.FunctorUtils$PeekIterator.hasNext(FunctorUtils.java:636) at gedi.util.FunctorUtils$SideEffectIterator.hasNext(FunctorUtils.java:717) at gedi.util.FunctorUtils$HasNextActionIterator.hasNext(FunctorUtils.java:807) at gedi.centeredDiskIntervalTree.CenteredDiskIntervalTreeStorage.fill(CenteredDiskIntervalTreeStorage.java:309) at gedi.core.region.GenomicRegionStorage.fill(GenomicRegionStorage.java:183) at executables.Bam2CIT.main(Bam2CIT.java:137) 2018-09-04 18:14:20.973 INFO Finished: gedi -e Bam2CIT -p riboseq.cit kmarx_riboseqAligned.sortedByCoord.out.bam

What might be the problem and how should I approach this? I have successfully worked with price on different organism, so installation, etc. is proper.

Best regards, Michał

florianerhard commented 6 years ago

Dear Michal,

that is a strange error... I would say this due to some corrupted bam file, but if this works otherwise...

Could you do two things: First enter

gedi Nashorn -e "println(load('xyz.bam').getReferenceSequences())"

This should output all chromosomes in the index of the bam file. Then, please enter

gedi Nashorn -e "load('xyz.bam').ei('IV-').print()"

and test a few of the chromosomes instead of IV. My guess is that something will be wrong with either IV or V (based on the progress indicator). If there is no error, you could also try

gedi Nashorn -e "load('xyz.bam').ei().print()"

Florian

m-swirski commented 6 years ago

Thanks for help, It appears that reads aligning to Mitochondrial sequence cause the problem, but I dont see a reason why would it be so. For the time being I can just remove chrM, but what may cause the problem?

Michał

boboppie commented 6 years ago

Dear Florian,

I also got a similar error when testing my data -

PRICE 1.0.2 Genome GRCm38

The error message:

2018-09-05 17:02:15.782 INFO Estimating maxpos...
2018-09-05 17:02:15.787 INFO Clustering reads
- Clustering reads 7364957 (66361.7/sec)
\ Identify maxpos 7469548 (67562.3/sec)
\ Collecting statistics 18-:88894181-88894212                       3610954 (33032.6/sec)PRICE
PRICE is an analysis method for Ribo-seq data.

Error:
java.lang.RuntimeException: Could not run gedi.riboseq.javapipeline.PriceCollectSufficientStatistics@9d2be09
Could not run gedi.riboseq.javapipeline.PriceCollectSufficientStatistics@9d2be09
Invalid reference index -1

gedi -e PRICE <Options>

General:
 -reads <reads>        The mapped reads from the ribo-seq experiment.
 -prefix <prefix>      The prefix used for all output files
 -genomic <genomic>    The indexed GEDI genome.

Commandline:
 -progress             Show progress
 -D                    Verbose output of errors
 -h                    Show usage
 -hh                   Show verbose usage
 -hhh                  Show extra verbose usage
 -dry                  Dry run
 -keep                 Do not remove temp files

java.lang.RuntimeException: java.lang.RuntimeException: Could not run gedi.riboseq.javapipeline.PriceCollectSufficientStatistics@9d2be09
    at gedi.util.job.schedule.DefaultPetriNetScheduler.run(DefaultPetriNetScheduler.java:222)
    at gedi.util.program.GediProgram$1.execute(GediProgram.java:359)
    at gedi.util.program.GediProgram.run(GediProgram.java:204)
    at executables.Price.main(Price.java:68)
Caused by: java.lang.RuntimeException: Could not run gedi.riboseq.javapipeline.PriceCollectSufficientStatistics@9d2be09
    at gedi.util.program.GediProgramJob.execute(GediProgramJob.java:63)
    at gedi.util.program.GediProgramJob.execute(GediProgramJob.java:27)
    at gedi.util.job.FireTransition.call(FireTransition.java:54)
    at gedi.util.job.FireTransition.call(FireTransition.java:25)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Invalid reference index -1
    at htsjdk.samtools.QueryInterval.<init>(QueryInterval.java:24)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:504)
    at gedi.region.bam.BamGenomicRegionStorage$BufferingGenomicRegionSpliterator.iterator(BamGenomicRegionStorage.java:992)
    at gedi.region.bam.BamGenomicRegionStorage$BufferingGenomicRegionSpliterator.tryAdvance(BamGenomicRegionStorage.java:1037)
    at gedi.util.functions.MappedSpliterator.tryAdvance(MappedSpliterator.java:54)
    at gedi.util.functions.MappedSpliterator.tryAdvance(MappedSpliterator.java:54)
    at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
    at gedi.util.functions.EI$4.hasNext(EI.java:141)
    at gedi.riboseq.cleavage.CleavageModelEstimator.collectEstimateData(CleavageModelEstimator.java:283)
    at gedi.riboseq.javapipeline.PriceCollectSufficientStatistics.execute(PriceCollectSufficientStatistics.java:59)
    at gedi.util.program.GediProgramJob.execute(GediProgramJob.java:61)
    ... 7 more

I've tested different data, it all seems happened on 18-.

I've run the three gedi Nashorn commands you suggested, the output: [X+,X-,Y+,Y-,10+,10-,11+,11-,12+,12-,13+,13-,14+,14-,15+,15-,16+,16-,17+,17-,18+,18-,MT+,MT-,19+,19-,1+,1-,2+,2-,3+,3-,4+,4-,5+,5-,6+,6-,7+,7-,8+,8-,9+,9-]

18-:88894181-88894212 [1] x1

18-:88894181-88894212 [1] x1

It is a bit cryptic, could you please help?

Best wishes, Fengyuan

florianerhard commented 6 years ago

Dear Fengyuan,

my guess is that this is also related to the mitochondrial genome (which you might want to remove anyways...). Could you check how this is named in the bam file (samtools view -H xyz.bam)?

Best, Florian

boboppie commented 6 years ago

Dear Florian,

Thanks for your quick reply. The mitochondrial is named chrM not MT. Does PRICE convert it to MT?

@HD VN:1.4  SO:coordinate
@SQ SN:chr1 LN:195471971
@SQ SN:chr2 LN:182113224
@SQ SN:chr3 LN:160039680
@SQ SN:chr4 LN:156508116
@SQ SN:chr5 LN:151834684
@SQ SN:chr6 LN:149736546
@SQ SN:chr7 LN:145441459
@SQ SN:chr8 LN:129401213
@SQ SN:chr9 LN:124595110
@SQ SN:chr10    LN:130694993
@SQ SN:chr11    LN:122082543
@SQ SN:chr12    LN:120129022
@SQ SN:chr13    LN:120421639
@SQ SN:chr14    LN:124902244
@SQ SN:chr15    LN:104043685
@SQ SN:chr16    LN:98207768
@SQ SN:chr17    LN:94987271
@SQ SN:chr18    LN:90702639
@SQ SN:chr19    LN:61431566
@SQ SN:chrX LN:171031299
@SQ SN:chrY LN:91744698
@SQ SN:chrM LN:16299

Best wishes, Fengyuan

boboppie commented 6 years ago

Dear Florian,

After removing all chrM reads and reheadered the bam (only removing reads is not working neither), I can successfully produce results. Many thanks for your help.

Do you think it is a PRICE issue to handle Mitochondrion incorrectly?

Best wishes, Fengyuan

florianerhard commented 6 years ago

Dear Fengyuan,

yeah, we try to handle cases where the genome comes from ucsc (e.g. with chr), but the annotation comes from ensembl (e.g. without chr). Unfortunately, there was a bug...

I fixed this in the most recent version (1.0.3)!

Florian

boboppie commented 6 years ago

Dear Florian,

I've just tested v1.0.3, it's working now. Thanks for the quick fix!

One minor point - when I run PRICE, it still shows 2018-09-06 13:32:59.233 INFO Gedi 1.0.2 (JAR) startup Are PRICE and GEDI having the same version number?

I also have a question about converting orf.cit to BED. It seems the BED output is a mix of BED6 (for non-spliced ORFs) and BED12 (for spliced ORFs). The number of ORFs between this two formats are different (BED has much fewer ORFs?) Is there a way to make all BED records to 12 columns?

Best wishes, Fengyuan

florianerhard commented 6 years ago

Dear Fengyuan,

GEDI is the framework that PRICE is built on, so yes (should include the PRICE version there as well...)!

Regarding BED6/BED12: This should not be the case anymore with 1.0.3!

Best, Florian

boboppie commented 6 years ago

Dear Florian,

You are right, 1.0.3 generates BED12 now, thanks for the improvement.

As I mentioned, the numbers of ORFs in orfs.tsv and orfs.bed are different, for example:

cmd: gedi -e ViewCIT -m bed -name 'd.getType()' orfs.cit >orfs.bed

orfs.tsv (47714):

ENSMUSG00000025915.14   ENSMUST00000166384.7_uORF_5 1+:9798143-9798246|9865130-9865186  1+:9798143-9798246|9865130-9865186  ACG uORF    0.78    0.47    0.031466    42.4    42.4
ENSMUSG00000025915.14   ENSMUST00000188625.1_ncRNA_4    1+:9827021-9827051  1+:9798231-9798246|9827012-9827051  CTG ncRNA   0.26    0.46    0.51739 56.0    56.0
ENSMUSG00000025915.14   ENSMUST00000168907.7_uORF_2 1+:9798245-9798246|9827012-9827047  1+:9798245-9798246|9827012-9827047  GTG uORF    0.11    0.30    0.071814    75.3    75.3
ENSMUSG00000025915.14   ENSMUST00000188625.1_ncRNA_7    1+:9827081-9827123  1+:9827069-9827123  CTG ncRNA   0.07    0.41    0.087234    32.1    32.1
ENSMUSG00000025915.14   ENSMUST00000168907.7_uORF_6 1+:9827081-9827119|9865130-9865149  1+:9827069-9827119|9865130-9865149  CTG uORF    0.07    0.41    0.29373 33.2    33.2
ENSMUSG00000025915.14   ENSMUST00000189050.1_ncRNA_9    1+:9865138-9865186  1+:9865138-9865186  ATG ncRNA   0.08    0.46    0.33833 17.6    17.6
ENSMUSG00000025915.14   ENSMUST00000189050.1_ncRNA_1    1+:9865252-9865348|9868382-9868466|9871629-9871702|9872257-9872337  1+:9865252-9865348|9868382-9868466|9871629-9871702|9872257-9872337  ATG ncRNA   0.57    0.53    1.0000  113.2   113.2
ENSMUSG00000025915.14   ENSMUST00000168907.7_CDS_0  1+:9865252-9865348|9868382-9868466|9871629-9871702|9872257-9872333|9877165-9877253|9879025-9879075|9880331-9880389|9881577-9881661|9881759-9881891|9884406-9884519|9884624-9884661|9885666-9885753|9886046-9886142|9888869-9889025|9891360-9891450|9898719-9898890  1+:9865252-9865348|9868382-9868466|9871629-9871702|9872257-9872333|9877165-9877253|9879025-9879075|9880331-9880389|9881577-9881661|9881759-9881891|9884406-9884519|9884624-9884661|9885666-9885753|9886046-9886142|9888869-9889025|9891360-9891450|9898719-9898890  ATG CDS 0.57    0.53    6.6474e-09  438.6   438.6
ENSMUSG00000025915.14   ENSMUST00000191338.1_ncRNA_3    1+:9884430-9884519|9884624-9884661|9885666-9885777  1+:9884430-9884519|9884624-9884661|9885666-9885777  AGG ncRNA   0.20    0.48    1.0000  72.6    72.6
ENSMUSG00000025915.14   ENSMUST00000191338.1_ncRNA_8    1+:9886064-9886181  1+:9886064-9886181  ATC ncRNA   0.11    0.26    1.0000  22.8    22.8
ENSMUSG00000056763.16   ENSMUST00000071087.11_Variant_0 1+:10038306-10038344|10045075-10045184|10047423-10047526|10058719-10058823|10064332-10064413|10065322-10065421|10066425-10066820|10074859-10074946|10077171-10077242|10082252-10082346|10083500-10083558|10085746-10085952|10088050-10088184|10088881-10088960|10090187-10090317|10095865-10096013|10104211-10104364|10108368-10108481|10112392-10112542|10112875-10113022|10113602-10113707|10116017-10116084|10116613-10116731|10126339-10126479|10127422-10127563|10129985-10130032|10132453-10132517|10133163-10133273|10134091-10134230|10135933-10136145  1+:10038306-10038344|10045075-10045184|10047423-10047526|10058719-10058823|10064332-10064413|10065322-10065421|10066425-10066820|10074859-10074946|10077171-10077242|10082252-10082346|10083500-10083558|10085746-10085952|10088050-10088184|10088881-10088960|10090187-10090317|10095865-10096013|10104211-10104364|10108368-10108481|10112392-10112542|10112875-10113022|10113602-10113707|10116017-10116084|10116613-10116731|10126339-10126479|10127422-10127563|10129985-10130032|10132453-10132517|10133163-10133273|10134091-10134230|10135933-10136145  TTG Variant 0.25    0.22    1.6777e-14  176.1   176.1
ENSMUSG00000056763.16   ENSMUST00000122156.7_CDS_1  1+:10045085-10045184|10047423-10047526|10058719-10058823|10059823-10059847|10064332-10064413|10065322-10065421|10066425-10066875    1+:10045085-10045184|10047423-10047526|10058719-10058823|10059823-10059847|10064332-10064413|10065322-10065421|10066425-10066875    ATG CDS 0.23    0.47    1.0000  68.1    68.1
ENSMUSG00000025925.14   ENSMUST00000188371.6_CDS_0  1+:15805677-15805957|15807640-15807736|15813040-15813162|15815885-15815972|15818933-15819083|15819650-15819763|15831506-15831566|15833400-15833489|15838181-15838288|15842920-15843082  1+:15805677-15805957|15807640-15807736|15813040-15813162|15815885-15815972|15818933-15819083|15819650-15819763|15831506-15831566|15833400-15833489|15838181-15838288|15842920-15843082  ATG CDS 0.66    0.54    0.0000  400.3   400.3
ENSMUSG00000025925.14   ENSMUST00000188371.6_iORF_1 1+:15805924-15805957|15807640-15807682  1+:15805924-15805957|15807640-15807682  AGG iORF    0.79    0.42    0.29004 11.8    11.8
ENSMUSG00000025940.6    ENSMUST00000176451.1_ncRNA_2    1+:16665313-16665505|16665707-16665731  1+:16665313-16665505|16665707-16665731  TTG ncRNA   0.16    0.36    0.93616 584.7   584.7
ENSMUSG00000025940.6    ENSMUST00000177501.1_Trunc_0    1+:16665313-16665505|16667664-16667773|16676967-16677425    1+:16665313-16665505|16667664-16667773|16676967-16677425    TTG Trunc   0.16    0.36    3.2460e-15  1119.2  1119.2
ENSMUSG00000025940.6    ENSMUST00000065373.5_Trunc_1    1+:16665313-16665505|16667667-16667773|16676967-16677425    1+:16665313-16665505|16667667-16667773|16676967-16677425    TTG Trunc   0.16    0.36    1.0000  1119.2  1119.2
ENSMUSG00000025940.6    ENSMUST00000065373.5_iORF_3 1+:16677136-16677163    1+:16677136-16677163    AGG iORF    0.08    0.36    0.44073 7.3 7.3
ENSMUSG00000025779.10   ENSMUST00000191260.1_ncRNA_2    1+:16688465-16688504    1+:16688465-16688504    ATG ncRNA   0.65    0.53    3.1100e-06  71.7    71.7

orfs.bed (12054):

1   9865252 9898890 CDS 0   +   0   0   0   16  96,84,73,76,88,50,58,84,132,113,37,87,96,156,90,171,    0,3130,6377,7005,11913,13773,15079,16325,16507,19154,19372,20414,20794,23617,26108,33467,
1   10038306    10136145    Variant 0   +   0   0   0   30  38,109,103,104,81,99,395,87,71,94,58,206,134,79,130,148,153,113,150,147,105,67,118,140,141,47,64,110,139,212,   0,6769,9117,20413,26026,27016,28119,36553,38865,43946,45194,47440,49744,50575,51881,57559,65905,70062,74086,74569,75296,77711,78307,88033,89116,91679,94147,94857,95785,97627,
1   15805677    15843082    CDS 0   +   0   0   0   10  280,96,122,87,150,113,60,89,107,162,    0,1963,7363,10208,13256,13973,25829,27723,32504,37243,
1   5083163 5084454 ncRNA   0   +   0   0   0   2   115,38, 0,1253,
1   5083498 5083618 uORF    0   +   0   0   0   1   120,    0,
1   5084456 5162165 Trunc   0   +   0   0   0   13  107,103,90,114,105,54,98,193,179,126,102,114,61,    0,4552,8906,11158,13572,16613,32933,39820,48627,51355,59293,65491,77648,
1   5143835 5149982 iORF    0   +   0   0   0   2   16,35,  0,6112,
1   6214901 6227997 uORF    0   +   0   0   0   2   56,58,  0,13038,
1   6230020 6274275 Trunc   0   +   0   0   0   22  53,127,171,203,427,171,185,187,82,62,104,127,1892,166,109,241,49,48,59,137,71,78,   0,3940,4208,8241,9931,14084,14774,15137,15404,15577,17685,18045,18255,30909,32813,33003,34521,35553,38677,40670,42680,44177,
1   7120434 7169882 Trunc   0   +   0   0   0   5   181,103,172,124,368,    0,27202,40457,42813,49080,

In orfs.tsv, there are 10 ORFs predicted in ENSMUSG00000025915.14, but only ENSMUST00000168907.7_CDS_0 appears in orfs.bed. Can you please let me know why there is a discrepancy?

Best wishes, Fengyuan

florianerhard commented 5 years ago

Dear Fengyuan,

sorry, I missed that question in the first place. The tsv contains all results (without filtering), the cit the FDR filtered list of orfs (see here).

Best, Florian

boboppie commented 5 years ago

Dear Florian,

That makes sense. Thank you so much for your help.

Best wishes, Fengyuan