java.lang.StringIndexOutOfBoundsException when running feature extraction

dylan6thomas commented 2 years ago

Hi, I am trying to run feature extraction on bulk-ATAC-seq data, but I am met with this error:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -23772
        at java.base/java.lang.String.substring(String.java:1840)
        at org.jax.FASTAReader.ForwardFASTAReader.setRegion(ForwardFASTAReader.java:61)
        at org.jax.bamextractor.ReadExtractor.extractFeatures(ReadExtractor.java:245)
        at org.jax.bamextractor.ReadExtractor.main(ReadExtractor.java:89)
        ... 5 more

The extraction continues to run, but I am unable to use tl.readTensors() on the resulting data. However this may have been an error in my code that I have since fixed.

Is this error a problem?

ajt986 commented 2 years ago

Hello, to me this looks like the input peak file might have an error in this step. There should be peak files generated in the output directory that you can check. Essentially there is one peak file that represents the original input peaks and another that uses those peak files to make 600 bp windows for those peaks. Problems at this step would also cause problems when reading using the readTensors() function.

dylan6thomas commented 2 years ago

Would using an unedited bigBed file cause this? I edited the file to only include the peaks and am running it again currently.

dylan6thomas commented 2 years ago

Do you know what kind of error to look for? It seems like a normal bed file to me

dylan6thomas commented 2 years ago

Would using an unedited bigBed file cause this? I edited the file to only include the peaks and am running it again currently.

I actually didn't use a bigBed file, I used a bed narrowPeak file

ajt986 commented 2 years ago

The feature extraction method should make two peak files with "_peaks.txt" and "_original.txt". It uses the _peaks.txt file since these will all be 600bp in length. It should look something like this:

chr1    29062   29661   1   29305   29420
chr1    237452  238051  2   237696  237809
chr1    521268  521867  3   521531  521605
chr1    662365  662964  4   662623  662708

It has the reformatted peak positions to fit the 600bp windows, the peak index followed by the original peak positions.

When I look at the code for the error:

ForwardFASTAReader.java:61

        if (start > _start){ 
            _fasta = _fasta.substring(start-_start-1); #line 61
            _start = start-1;
        }

This suggests to me that these start positions used to get the reference sequence are negative. This could mean that there is a negative start position in the file or somehow the start position was so large that the integer overflowed to a negative number. This is my best guess at the moment of what's happening anyway. What reference genome are you using?

dylan6thomas commented 2 years ago

I am using Hg38

dylan6thomas commented 2 years ago

There are no negative values and the largest integer is around 200,000,000, so I am not sure

ajt986 commented 2 years ago

Does your _peaks.txt file look similar to the example I posted above? The difference between the start and end positions should be 600 for each one.

dylan6thomas commented 2 years ago

This is a sample of the _peaks.txt file chr1 100038019 100038618 0 100037635 100039004 chr1 100038019 100038618 1 100037635 100039004 chr1 100038019 100038618 2 100037635 100039004 chr1 100038019 100038618 3 100037635 100039004 chr1 100038019 100038618 4 100037635 100039004 chr1 100075626 100076225 5 100075800 100076052 chr1 100132683 100133282 6 100132510 100133456 chr1 100132683 100133282 7 100132510 100133456 chr1 100132683 100133282 8 100132510 100133456 chr1 1001673 1002272 9 1001831 1002115

I'm not sure why the first 5 rows are identical.

dylan6thomas commented 2 years ago

chr1    100038019   100038618   0   100037635   100039004 \n
chr1    100038019   100038618   1   100037635   100039004 \n
chr1    100038019   100038618   2   100037635   100039004 \n
chr1    100038019   100038618   3   100037635   100039004 \n
chr1    100038019   100038618   4   100037635   100039004 \n
chr1    100075626   100076225   5   100075800   100076052 \n
chr1    100132683   100133282   6   100132510   100133456 \n
chr1    100132683   100133282   7   100132510   100133456 \n
chr1    100132683   100133282   8   100132510   100133456 \n
chr1    1001673 1002272 9   1001831 1002115

I put the \n in myself

ajt986 commented 2 years ago

It's possible that the duplicates are causing the issue. It looks like the last three are identical too. Does your input file also contain these duplicates? I'd try removing the duplicates and running the feature extraction again.

dylan6thomas commented 2 years ago

There were a significant amount of duplicates, but the same issue arose even after removing them

dylan6thomas commented 2 years ago

`chr1 100038019 100038618 0 100037635 100039004

chr1 100075626 100076225 1 100075800 100076052

chr1 100132683 100133282 2 100132510 100133456

chr1 1001673 1002272 3 1001831 1002115

chr1 100188582 100189181 4 100188751 100189014

chr1 100210662 100211261 5 100210857 100211068

chr1 100210934 100211533 6 100211134 100211335

chr1 100213069 100213668 7 100213216 100213522

chr1 100217697 100218296 8 100217852 100218143 `

dylan6thomas commented 2 years ago

I am not sure how it would be possible for the peak start to be below 0; the range of the peak start lengths is around 10,000-200,000,000

Is there any type of normalization on the peaks?

ajt986 commented 2 years ago

The only functions applied are 1) identifying the center position and 2) extending by ~300bp upstream/downstream. That seems to be working fine though.

The feature extractor sorts these peaks by chromosome and then start position to incorporate the read and reference sequence information. The reference FASTA files should be separated for each chromosome (i.e., chr1.fa, chr2.fa etc). I used the chromosome FASTA files available from UCSC.

This is a strange error and after reviewing the code several times, I still don't know why it would be doing this. Does it still fail if you were to just take a subset of the peaks? All one chromosome, peaks < with locations 100,000bp, etc? Hopefully this will narrow down what the problem is.

dylan6thomas commented 2 years ago

I tried running it on a smaller subset (just the first few thousand in the dataset) and it worked fine...

dylan6thomas commented 2 years ago

Were there any preprocessing steps your team needed to take when using data from the ENCODE database? That is where I sourced all of my data from.

dylan6thomas commented 2 years ago

Which files from ENCODE did you use exactly?

ajt986 commented 2 years ago

I used the ATAC-seq bam files available from encode. The full list of the ENCODE data used can be found in the publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009670

For processing, reads were aligned using BWA mem, shifted, and duplicates removed. Peaks were then called using MACS2 with the BAMPE option.

dylan6thomas commented 2 years ago

So would using the bed narrowPeak files available with the dataset cause problems?

dylan6thomas commented 2 years ago

Also, I am not sure what you mean by shifted, is that an option in BWA mem

ajt986 commented 2 years ago

The narrowPeak files should not cause the problems. I wish the bug was more clear, but it's difficult to diagnose.

The shifting is essentially just accounting for the 9bp overhang from transposition of the fragment. The assumption is that TN5 transposase is at the center of the cut size. So shifting is essentially just shifting the forward fragment cut sites by 5bp, and reverse fragment cut sites by -4bp.

UcarLab / CoRE-ATAC

java.lang.StringIndexOutOfBoundsException when running feature extraction #5