galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 992 forks source link

BED-to-BigBED File Conversion Tool has a number of issues: #5419

Open alexlenail opened 6 years ago

alexlenail commented 6 years ago
  1. The first time I tried to run this, I got Couldn't open /galaxy-central/tool-data/shared/ucsc/chrom/hg19.len. As it turns out, there was no folder /galaxy-central/tool-data/shared/ucsc/chrom/. I decided to just wget it. However, this should probably be distributed with Galaxy going forwards?

  2. The BED files I am trying to convert to BigBEDs are output narrowpeaks from MACS2, which is to say they are BED6+4 formatted. This seems to cause the tool to crash. Since converting MACS2 narrowpeaks to BigBEDs is part of the canonical ENCODE pipeline for ATAC-Seq this tool should probably be able to handle this use case. @jennaj recommended doing a sortBED in advance but that did not solve the issue.

Offshoot of Galaxy 17.05, Docker container from @bgruening

bgruening commented 6 years ago

@zfrenchee can you confirm that you are talking about https://github.com/galaxyproject/galaxy/blob/dev/tools/filters/bed_to_bigbed.xml

What was the error you got?

alexlenail commented 6 years ago

@bgruening Yes, that's the tool

The error message I get after I manually wget'ed the hg19.len file was:

column #10 isSizeLink do not match: Yours=[0]  BED Standard=[1]
asObjects differ.
bgruening commented 6 years ago

@zfrenchee your bedfile seems to be at fault. The tool is simply using the USCS conversion tool. Can you check your BED file? Also see here: https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/opwtvFfIslQ

alexlenail commented 6 years ago

@bgruening

The BED files I am trying to convert to BigBEDs are output narrowpeaks from MACS2, which is to say they are BED6+4 formatted. This seems to cause the tool to crash. Since converting MACS2 narrowpeaks to BigBEDs is part of the canonical ENCODE pipeline for ATAC-Seq this tool should probably be able to handle this use case. jennaj recommended doing a sortBED in advance but that did not solve the issue.

mblue9 commented 6 years ago

Hi guys, looks like maybe the tool can't take macs2 narrowpeak as it needs to also use an autoSql file? see this post (and below) from the UCSC google group: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/narrowpeak$20bigbed/genome/wZsmrO9m0bg/gyO_KeRBAwAJ

narrowPeak files can be visualized directly on the UCSC Genome Browser as a custom track, no need to convert to bigBed first. However if you wanted to, example 3 on this page: http://genome.ucsc.edu/goldenPath/help/bigBed.html is a good example of how to convert a non-standard bed file to a bigBed file, in that you need to supply bedToBigBed with an autoSql file that describes your data. For more information see this question from our mailing list archive: https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/9PXjH2mlqrE/MrBs3pZ9WLEJ

mblue9 commented 6 years ago

Ah the Galaxy BED-to-bigBed converter tool says that non-standard bed (that require autoSql files) are not currently supported:

Currently, the bedFields option to specify the number of non-standard fields is not supported as an AutoSQL file must be provided, which is a format currently not supported by Galaxy.

bgruening commented 6 years ago

As a reference: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/lib/encode/narrowPeak.as;hb=HEAD

alexlenail commented 6 years ago

Many thanks for looking into this @mblue9. The strange part is we use a command line tool which executes this without a problem, I was assuming the tools would be the same.

bedToBigBed a_sorted_bed_file.bed -bedFields=6 /path/to/chrom.sizes out_file_name.bed

(thanks @bwassie)

alexlenail commented 6 years ago

@bgruening @mblue9 if adding this functionality is impossible or impractical in the scope of the Galaxy BED-to-BigBED tool, please feel free to close this issue.

mblue9 commented 6 years ago

@zfrenchee are you're using the ENCODE bigToBigBed tool? https://www.encodeproject.org/software/bedToBigBed/

as that says it can handle non-standard bed but also says it needs a .as file:

bedToBigBed takes a standard bed file or a non-standard bed file with associated .as file to create a compressed bigBed version

Just wondering if you had to cut out 6 columns from the macs2 file to get it to work without the .as file? How many columns does your sorted bed file (a_sorted_bed_file.bed) have?

bwassie commented 6 years ago

Hi @mblue9, Yeah we're using the UCSC bedToBigBed tool. I just checked and it runs and produces an output when we use the standard macs2 narrowPeak file with 10 columns. I don't know if the bigBed file will visualize correctly on UCSC but it does run without an error.

mblue9 commented 6 years ago

Thanks for the info @bwassie. My guess is if you don't use the .as file the output may not view correctly in UCSC. As did you see there was a new format bigNarrowPeak (announced in December by UCSC) and that also requires the bedToBigBed tool to be run with a .as file, see below.

Does anyone know if the Galaxy bedToBigBed wrapper could be changed to accept an autoSql (.as) file?

Below from http://genome.ucsc.edu/goldenPath/help/bigNarrowPeak.html:

bigNarrowPeak Track Format The bigNarrowPeak format stores annotation items that are a single block with a single base peak within that block, much as BED files indexed as bigBeds do. A bigNarrowPeak file is a standard six field bed with four additional fields that contain three doubles with scoring information and the location of the single base peak. It is the binary version of the ENCODE narrowPeak or point-source peak format.

The bigNarrowPeak files are created using the program bedToBigBed, run with the -as option to pull in a special autoSql (.as) file that defines the extra fields of the bigNarrowPeak.

The bigNarrowPeak files are in an indexed binary format. The main advantage of this format is that only those portions of the file needed to display a particular region are transferred to the Genome Browser server. Because of this, indexed binary files have considerably faster display performance than regular BED format files when working with large data sets. The bigNarrowPeak file remains on your local web-accessible server (http, https or ftp), not on the UCSC server, and only the portion needed for the currently displayed chromosomal position is locally cached as a "sparse file". If you do not have access to a web-accessible server and need hosting space for your bigNarrowPeak files, please see the Hosting section of the Track Hub Help documentation.

bigNarrowPeak file definition The following autoSql definition is used to specify bigNarrowPeak files. This definition, contained in the file bigNarrowPeak.as, is pulled in when the bedToBigBed utility is run with the -as=bigNarrowPeak.as option.