epic - is it usual that analyses take time?

biocore-ntnu / epic

(DEPRECATED) epic: diffuse domain ChIP-Seq caller based on SICER

http://bioepic.readthedocs.io

MIT License

31 stars 6 forks source link

epic - is it usual that analyses take time? #46

Closed anishdattani closed 7 years ago

anishdattani commented 7 years ago

Hi,

Just to ask, I have run epic and it seems to be stuck at this step:

Making duplicated bins unique by summing them. (File: count_reads_in_windows, Log level: INFO, Time: Wed, 09 Nov 2016 22:19:45 ) Merging ChIP and Input data. (File: helper_functions, Log level: INFO, Time: Wed, 09 Nov 2016 22:22:00 ) 0.8 effective_genome_size (File: compute_background_probabilites, Log level: DEBUG, Time: Wed, 09 Nov 2016 22:24:23 ) 200 window size (File: compute_background_probabilites, Log level: DEBUG, Time: Wed, 09 Nov 2016 22:24:23 ) 61753836 total chip count (File: compute_background_probabilites, Log level: DEBUG, Time: Wed, 09 Nov 2016 22:24:23 ) 15438459000.0 average_window_readcount (File: compute_background_probabilites, Log level: DEBUG, Time: Wed, 09 Nov 2016 22:24:23 )

It has been >24hrs now, is this usual?

My input command was as follows:

epic --treatment --treatment /pathtofolder/lib.filtered.H3K27me3_sorted.MARKDUP.bedpe --input /pathtofolder/lib.filtered.input.sorted_2_MARKDUP.bedpe chromsizes /home/anish/marked_duplicates_ChIP-seq_YM/smed_only/SmedAsxl.size.file --bigwig BIGIWG -pe --fragment-size 300 -cpu 30 --effective_genome_size 0.8

Many thanks, Anish

endrebak commented 7 years ago

No, epic should finish rather quickly. I use massive files and it takes 5 minutes, tops (optionally writing additional data such as matrixes to disk might take a little extra time).

Your non-standard files are causing this to happen. I can't debug this without the files unfortunately. If you have the opportunity to share them I can try to look closer at it.

Is there a reason you are doing ChIP-Seq calling on contigs?

On Fri, Nov 11, 2016 at 12:22 AM, anishdattani notifications@github.com wrote:

Hi,

Just to ask, I have run epic and it seems to be stuck at this step:

Making duplicated bins unique by summing them. (File: count_reads_in_windows, Log level: INFO, Time: Wed, 09 Nov 2016 22:19:45 ) Merging ChIP and Input data. (File: helper_functions, Log level: INFO, Time: Wed, 09 Nov 2016 22:22:00 ) 0.8 effective_genome_size (File: compute_background_probabilites, Log level: DEBUG, Time: Wed, 09 Nov 2016 22:24:23 ) 200 window size (File: compute_background_probabilites, Log level: DEBUG, Time: Wed, 09 Nov 2016 22:24:23 ) 61753836 total chip count (File: compute_background_probabilites, Log level: DEBUG, Time: Wed, 09 Nov 2016 22:24:23 ) 15438459000.0 average_window_readcount (File: compute_background_probabilites, Log level: DEBUG, Time: Wed, 09 Nov 2016 22:24:23 )

It has been >24hrs now, is this usual?

My input command was as follows:

epic --treatment --treatment /pathtofolder/lib.filtered.H3K27me3_sorted.MARKDUP.bedpe --input /pathtofolder/lib.filtered.input.sorted_2_MARKDUP.bedpe chromsizes /home/anish/marked_duplicates_ChIP-seq_YM/smed_only/SmedAsxl.size.file --bigwig BIGIWG -pe --fragment-size 300 -cpu 30 --effective_genome_size 0.8

Many thanks, Anish

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/endrebak/epic/issues/46#issuecomment-259837591, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0sU9b3ZBhBLWtxiVgraJb80oKlz1ks5q86csgaJpZM4KvQMX .

anishdattani commented 7 years ago

Hi,

We are using a genome that is understudied, hence no chromosome resolution but lots of contigs. We were just exploring whether epic could deal with this data.

I don't mind sharing the bedpe files - what would be the best way?

Best wishes, Anish

endrebak commented 7 years ago

Dropbox/google drive link? Thanks for this. Please include the chromsizes file.

As far as I can see, the statistical method in SICER (and hence epic) should be able to deal with this situation. I have never used contigs myself, but I guess you can have many overlapping regions? Anyways, your FDRs will be more conservative, but otherwise epic should be able to find the most enriched regions and order them correctly.

You might want to run the script epic-effective on the fasta to compute the effective genome size.

Ps. the --fragment-size is not used for paired end data. I should update the help message.

Psps. it seems you forgot the two dashes in the --chromsizes flag :)

anishdattani commented 7 years ago

Hi Endre,

This is the link to the BedPe files - you can find input and sample files here.

https://user:Cifperel5@zoo-hydra.zoo.ox.ac.uk https://user:Cifperel5@zoo-hydra.zoo.ox.ac.uk/

I will try again, accounting for the chromisize and fragment size mistakes.

Many thanks and best wishes, Anish

On 11 Nov 2016, at 13:01, Endre Bakken Stovner notifications@github.com wrote:

Dropbox/google drive link? Thanks for this.

As far as I can see, the statistical method in SICER (and hence epic) should be able to deal with this situation. I have never used contigs myself, but I guess you can have many overlapping regions? Anyways, your FDRs will be more conservative, but otherwise epic should be able to find the most enriched regions and order them correctly.

You might want to run the script epic-effective on the fasta to compute the effective genome size.

Ps. the --fragment-size is not used for paired end data. I should update the help message.

Psps. it seems you forgot the two dashes in the --chromsizes flag :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/endrebak/epic/issues/46#issuecomment-259952546, or mute the thread https://github.com/notifications/unsubscribe-auth/AVEqR56CBJXHaVHzTAbP4xjsODDi_pd6ks5q9GcrgaJpZM4KvQMX.

endrebak commented 7 years ago

Thanks.

1) Would you mind me taking a small subset of these data and including them in an automated test?

2) Do you have a file of the chromsizes for these bedpes? If not I can make one myself where the length of the chromosome is equal to the end of the rightmost read (this won't be much work).

Ps. this is first on my todo list after merging a PR I just got, but I cannot guarantee that I will solve it quickly.

On Fri, Nov 11, 2016 at 7:35 PM, anishdattani notifications@github.com wrote:

Hi Endre,

This is the link to the BedPe files - you can find input and sample files here.

https://user:Cifperel5@zoo-hydra.zoo.ox.ac.uk https://user:Cifperel5@zoo- hydra.zoo.ox.ac.uk/

I will try again, accounting for the chromisize and fragment size mistakes.

Many thanks and best wishes, Anish

On 11 Nov 2016, at 13:01, Endre Bakken Stovner notifications@github.com wrote:

Dropbox/google drive link? Thanks for this.

As far as I can see, the statistical method in SICER (and hence epic) should be able to deal with this situation. I have never used contigs myself, but I guess you can have many overlapping regions? Anyways, your FDRs will be more conservative, but otherwise epic should be able to find the most enriched regions and order them correctly.

You might want to run the script epic-effective on the fasta to compute the effective genome size.

Ps. the --fragment-size is not used for paired end data. I should update the help message.

Psps. it seems you forgot the two dashes in the --chromsizes flag :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/endrebak/epic/issues/46#issuecomment-259952546>, or mute the thread https://github.com/notifications/unsubscribe-auth/ AVEqR56CBJXHaVHzTAbP4xjsODDi_pd6ks5q9GcrgaJpZM4KvQMX.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/endrebak/epic/issues/46#issuecomment-260024154, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0lN74s3Y46-r2ifNbsIFK-WCUKfLks5q9LVmgaJpZM4KvQMX .

endrebak commented 7 years ago

I tried this, and it worked on a small subset, so I guess it should work on the large files. Am running it on the full files right now.

Ps. epic runs stuff in parallell per chromosome, so you should use as many cores as possible.

anishdattani commented 7 years ago

Thanks for your time and help Andre.

One more question - when using effective genome size using epic-effective I get the following message:

Traceback (most recent call last):

File "./epic-effective", line 35, in effective_genome_size(fasta, read_length, nb_cpu, tmpdir) File "/home/anish/.local/lib/python2.7/site-packages/epic/scripts/effective_genome_size.py", line 22, in effective_genome_size logging.info("Temporary directory: " + tmpdir) TypeError: cannot concatenate 'str' and 'NoneType’ objects

input command was:

./epic-effective --read-length=300 --nb-cpu=32 /pathtofolder/genome

Best Anish

On 14 Nov 2016, at 10:58, Endre Bakken Stovner notifications@github.com wrote:

I tried this, and it worked on a small subset, so I guess it should work on the large files. Am running it right now.

Ps. epic runs stuff in parallell per chromosome, so you should use as many cores as possible.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/endrebak/epic/issues/46#issuecomment-260306458, or mute the thread https://github.com/notifications/unsubscribe-auth/AVEqR9dvqdPPovEAPe-FiB50MRwd-ceuks5q-D64gaJpZM4KvQMX.

endrebak commented 7 years ago

It worked for me too! Happy, happy. I love it when you are able to use a tool in an unexpected or unintended way.

The error happened because you had not set a temporary directory, and I had forgot to set a default. Fixed in 0.1.23 which is out on pypi now.

endrebak commented 7 years ago

Also, if you are able to use this in a paper eventually, I'd be very happy to read it. I will almost certainly even cite it in my eventual paper to show what is possible with epic.

anishdattani commented 7 years ago

Yes of course - if I use this in an ultimate publication I will of course send it to you before hand.

Is there a way of changing the file path of Jellyfish used to calculate effective genome size - I couldn’t seem to work it out. This is what I get:

/bin/sh: 1: jellyfish: not found /bin/sh: 1: jellyfish: not found Traceback (most recent call last): File "./epic-effective", line 35, in effective_genome_size(fasta, read_length, nb_cpu, tmpdir) File "/home/anish/.local/lib/python2.7/site-packages/epic/scripts/effective_genome_size.py", line 55, in effective_genome_size shell=True) File "/usr/lib/python2.7/subprocess.py", line 544, in check_output raise CalledProcessError(retcode, cmd, output=output) subprocess.CalledProcessError: Command 'jellyfish stats /home/anish/tmp/SmedAsxl_genome_v1.1.nt.jf' returned non-zero exit status 127 rm: cannot remove `/home/anish/tmp/SmedAsxl_genome_v1.1.nt.jf': No such file or directory

Best, A

On 15 Nov 2016, at 07:21, Endre Bakken Stovner notifications@github.com wrote:

It worked for me too! Happy, happy. I love it when you are able to use a tool in an unexpected or unintended way.

The error happened because you had not set a temporary directory, and I had forgot to set a default. Fixed in 0.1.23 which is out on pypi now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/endrebak/epic/issues/46#issuecomment-260566243, or mute the thread https://github.com/notifications/unsubscribe-auth/AVEqR-NrkpNggcsQr0etCv1GroSuMnj5ks5q-V2HgaJpZM4KvQMX.

endrebak commented 7 years ago

It seems like you do not have jellyfish on your path. If you use the bash shell, I think this should work:

PATH=/full_path/to/jellyfish_directory:$PATH epic-effective ...

I do not think you need to use epic-effective though. For most mammalian genomes, the EGS is like 0.9 > with a read length of 100, with a read length of 300 it is going to be close to 1 so you can probably just use that. The only difference will be the n-th decimals of your FDR score.

On Tue, Nov 15, 2016 at 11:04 AM, anishdattani notifications@github.com wrote:

Yes of course - if I use this in an ultimate publication I will of course send it to you before hand.

Is there a way of changing the file path of Jellyfish used to calculate effective genome size - I couldn’t seem to work it out. This is what I get:

/bin/sh: 1: jellyfish: not found /bin/sh: 1: jellyfish: not found Traceback (most recent call last): File "./epic-effective", line 35, in effective_genome_size(fasta, read_length, nb_cpu, tmpdir) File "/home/anish/.local/lib/python2.7/site-packages/epic/ scripts/effective_genome_size.py", line 55, in effective_genome_size shell=True) File "/usr/lib/python2.7/subprocess.py", line 544, in check_output raise CalledProcessError(retcode, cmd, output=output) subprocess.CalledProcessError: Command 'jellyfish stats /home/anish/tmp/SmedAsxl_genome_v1.1.nt.jf' returned non-zero exit status 127 rm: cannot remove `/home/anish/tmp/SmedAsxl_genome_v1.1.nt.jf': No such file or directory

Best, A

On 15 Nov 2016, at 07:21, Endre Bakken Stovner notifications@github.com wrote:

It worked for me too! Happy, happy. I love it when you are able to use a tool in an unexpected or unintended way.

The error happened because you had not set a temporary directory, and I had forgot to set a default. Fixed in 0.1.23 which is out on pypi now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/endrebak/epic/issues/46#issuecomment-260566243>, or mute the thread https://github.com/notifications/unsubscribe-auth/AVEqR- NrkpNggcsQr0etCv1GroSuMnj5ks5q-V2HgaJpZM4KvQMX.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/endrebak/epic/issues/46#issuecomment-260599153, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0hSgfCyLzm4WYRI400MxBlPxdzC4ks5q-YOtgaJpZM4KvQMX .