bedops / bedops

:microscope: BEDOPS: high-performance genomic feature operations
https://bedops.readthedocs.io/
Other
295 stars 59 forks source link

Problem that needs solution #226

Closed demis001 closed 5 years ago

demis001 commented 5 years ago

I have five bed files: A.bed, B.bed, C.bed, D.bed, E.bed.

1) I want to extract all locus in A.bed that have an overlaps with B.bed and E.bed BUT NOT with C.bed OR

2) Extract all locus in A.bed that have an overlap with C.bed and D.bed BUT NOT with B.bed

Any help in the approach will be appreciated!

I have attempted with Union and merge files and extract using awk, but it didn't work well.

Best, @demis001

sjneph commented 5 years ago

Our forum is not often used, but a good place to ask these kinds of things: https://bedops.altius.org/forum/ - we do monitor it.

The answers depend upon your exact meaning of some of the words. All files need to be sorted with sort-bed. # all bases over all input files are covered in the output (a union of bases) bedops -m B.bed E.bed

# bases in common with B.bed E.bed (an intersection of bases) bedops -i B.bed E.bed

# full rows in A.bed that overlap either B.bed or E.bed by 1 bp bedops -e 1 A.bed B.bed E.bed or bedops -m B.bed E.bed | bedops -e 1 A.bed -

The default for bedops -e is 100% overlap required. I used 1 (bp) here, and you can use either bp or percentages. bedops -e 5 or bedops -e 25% The -e operation is "element-of" and it gets complete rows from A.bed. You could instead use -i (intersection) if you are interested in exact regions of overlaps.

# grab full elements of A.bed that overlap the intersection of B.bed and E.bed by 10 bp or more # and do not overlap C.bed or D.bed at all.

bedops -i B.bed E.bed \
  | bedops -e 10 A.bed - \
  | bedops -n 1 - C.bed D.bed

# grab bases in A.bed that intersect both B.bed and E.bed but not the others

bedops -i A.bed B.bed E.bed \
  | bedops -c - C.bed D.bed

The longer names (--intersection, --element-of, --not-element-of, --complement) might be easier to think about when starting out with bedops, and detailed examples/information are available with bedops --help-complement, for example.

Even more powerful overlap possibilities are available with bedmap.

# grab full regions in A.bed that overlap C.bed and D.bed by 3 bp or more, and do not overlap B.bed by 3 bp or more

bedops -i C.bed D.bed \
  | bedops -e 3 A.bed - \
  | bedops -n 3 - B.bed

Hope that helps.

sjneph commented 5 years ago

a few typos and errors - updated on github

demis001 commented 5 years ago

@sjneph,

Thank you very much for detailed response! Sure, I will post in the forum in the future.

One more question: What is the best option to get a region that overlap 80% in 3 bed files. "-i" doesn't accept a percentage overlap. The "-m" option extend the region. I scanned the genome for specific pattern in three tissues and want to know if these region overlap. All bed are sorted.

I did the following lines but using the merge and union as input and count if the region exist in the three tissues. The problem with this approach is, it extended the region while merging the three bed.

awk '{ print $1"\t"$2"\t"$3"\t"FILENAME }' A.bed B.bed C.bed | sort-bed - > union.bed
 bedops -m A.bed B.bed C.bed > merge.bed
bedmap --echo --echo-map-id-uniq --bp-ovr 10 --delim "\t" merge..bed union.bed | awk '{ print $1"\t"$2"\t"$3"\t"split($4, a, ";"); }' > result.bed

By the way: I want to extract the locus that are common in all three bed files (tissues) at this stage.

Thank you very much!

sjneph commented 5 years ago

It's an interesting challenge. I'll try to look at it in detail tomorrow.

sjneph commented 5 years ago

Try this:

bedops -s A.bed B.bed C.bed \
  | bedops -d union.bed - \
  | bedmap --fraction-ref 0.8 --skip-unmapped --echo-map-range union.bed - \
  | sort-bed - \
  | bedmap --fraction-map 0.8 --skip-unmapped --echo --echo-map-id-uniq --delim "\t" - union.bed \
  | awk 'split($4,a,";")==3' \
  | bedops -m -

Something close to this should work. The first line finds bases that are specific to only 1 input file. The second line removes those from consideration. Everything remaining intersects at least 2 input files. The third line grabs the subset of union.bed that is covered 80% or more by remaining bases, and outputs the genomic extent of those bases. These are not guaranteed to be in sort-bed order, so we next sort. The fifth line maps on names from union.bed that are 80% or more covered by the genomic extents, and then awk selects those genomic extents that fit your requirements. The last piece merges over the genomic extents for the final answer, effectively selecting the longest genomic extents that meet the requirements. I did not run this, but do expect that it will be close to the solution you're after.

sjneph commented 5 years ago

some mistakes in original post, corrected on github.