biocore-ntnu / epic2

Ultraperformant reimplementation of SICER
https://doi.org/10.1093/bioinformatics/btz232
MIT License
56 stars 9 forks source link

"Exception: Not valid subsetter: 1" while using epic2-df #32

Closed wesleylcai closed 5 years ago

wesleylcai commented 5 years ago

I'm trying to analyze knockout and wildtype samples (including input for each) using epic2-df. However I get the following error: "Exception: Not valid subsetter: 1"

Here's the full output: epic2-output.txt

Here are examples (head -n100 of the input files): TKO: Sample_2D_KDM2A_me3.mqsd.head100.bedpe.txt CKO: Sample_2D_KDM2A_input.mqsd.head100.bedpe.txt TWT: Sample_2D_Arab2_me3.mqsd.head100.bedpe.txt CWT: Sample_2D_Arab2_input.mqsd.head100.bedpe.txt

Here's my command: epic2-df --treatment-knockout Sample_2D_KDM2A_me3.mqsd.bedpe --control-knockout Sample_2D_KDM2A_input.mqsd.bedpe --treatment-wildtype Sample_2D_Arab2_me3.mqsd.bedpe --control-wildtype Sample_2D_Arab2_input.mqsd.bedpe --genome hg19 --false-discovery-rate-cutoff 0.01 --false-discovery-rate-comparison 0.01 --bin-size 200 --gaps-allowed 3 --fragment-size 200 --chromsizes hg19.chrom.sizes --output-knockout Sample_2D_KDM2A_me3.mqsd --output-wildtype Sample_2D_Arab2_me3.mqsd;

Interesting, some of the commands worked (with another set of bedpe) so it may be incompatibility between some of my bedpe files? Any assistance would be appreciated!

endrebak commented 5 years ago

Is this reproducible with just the head? Will look at it on Monday :) Thanks for bothering to report :)

wesleylcai commented 5 years ago

I tried it with the head and also again with head -n100000

Looks like it works for those files... Hmm so maybe there are some wonky lines in the files? How do you think we can pin-point the problem?

endrebak commented 5 years ago

The error seems to be in my pyranges library. The error message says that the chromosome is an int, but it should always be a string. Dunno why it happens, but I am trying to fix it :)

Can you check your version of pyranges with

$ python
import pyranges as pr
pr.__version__
wesleylcai commented 5 years ago

Ahaaaa. I think I might know why... I used bowtie2 to map my fastq and then converted them to bedpe using bedtools. The scaffold names are "1, 2, 3...X, Y, MT", instead of "chr1, chr2, chr3...chrX, chrY, chrM". Indeed I had to use a custom chrom.sizes file that lists the scaffolds as 1,2,3.

Do you think this could be the cause?

endrebak commented 5 years ago

The error is in epic2-df after it has successfully run epic on both KO and WT. So the error happens when it works on the result of those epic2 runs.

endrebak commented 5 years ago

Do you think this could be the cause?

No, but I wondered why you used a custom genome sizes file for hg19. When I realized why you did it I added a warning message to epic2 when the chromosome size names and chromosome names in the read file are incompatible.

endrebak commented 5 years ago

That is okay, I am hoping the error is due to your pyranges being old :)

wesleylcai commented 5 years ago

Looks like it's version 0.0.53

$ python Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] on linux import pyranges as pr pr.__version__ '0.0.53'

The error is in epic2-df after it has successfully run epic on both KO and WT. So the error happens when it works on the result of those epic2 runs.

Indeed, the individual outputs work well and I get two files in the output folder. So I agree with your assessment.

endrebak commented 5 years ago

That is the latest version. Do you have the opportunity to send the zipped dataset to me via dropbox or google drive? I will treat it as confidential. Then debugging would be easy :)

On Fri, Sep 6, 2019 at 4:37 PM wescaiju notifications@github.com wrote:

Looks like it's version 0.0.53

`(/gpfs/ysm/project/wc376/conda_envs/for_epic2) [wc376@c13n02 ~]$ python Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import pyranges as pr pr.version '0.0.53'`

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/32?email_source=notifications&email_token=AEHURUQJDJSB6PG42PFEAELQIJTMPA5CNFSM4IUI2H52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6DBIVA#issuecomment-528880724, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHURUQSUL4BKGIZ7LUPUSTQIJTMPANCNFSM4IUI2H5Q .

wesleylcai commented 5 years ago

Yes, I can send you a google drive link. Which email should I use?

endrebak commented 5 years ago

endrebak85 # gmail.com. Thanks!

On Fri, Sep 6, 2019 at 4:49 PM wescaiju notifications@github.com wrote:

Yes, I can send you a google drive link. Which email should I use?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/32?email_source=notifications&email_token=AEHURUWMNPSN75DH2URCZD3QIJUZNA5CNFSM4IUI2H52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6DCNZA#issuecomment-528885476, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHURUQMNTIA5NXVTOGUNWTQIJUZNANCNFSM4IUI2H5Q .

wesleylcai commented 5 years ago

I have sent you an invite via google drive! Thanks for your help.

endrebak commented 5 years ago

I have downloaded the files and am running the analysis now. I have some potential fixes that I will attempt tomorrow :)

endrebak commented 5 years ago

l was able to reproduce the error. Hooray! Will continue tomorrow. Thanks for sharing a reproducible example :)

endrebak commented 5 years ago

(Did not mean to close)

endrebak commented 5 years ago

(Notes to self)

The error seems to be due to the following:

When pandas reads a table it guesses the types of the columns. For our files it guesses that the chromosome is of type int since it starts with 1, ..., 2, ...., but when it gets to Y and X it changes its mind and thinks the type is object/str.

sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

So you end up with the following different chromosomes:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']

So initially, it uses an int for lookup.

I have fixed this in epic2 now, I will also need to find a fix that works for PyRanges in general.

Try pip install epic2==0.0.41. The fix will take a few hours to be out on bioconda.

Feel free to reopen if this did not fix it for you :)