arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
140 stars 85 forks source link

groupBy retaining sort order on distinct operation #105

Open duartemolha opened 9 years ago

duartemolha commented 9 years ago

Hi Guys

I was using groupBy and noticed something a bit annoying and I was wondering if you could improve it.

assume I have a file with this content:

MPL NM_005373.2 1 MPL NM_005373.2 2 MPL NM_005373.2 3 MPL NM_005373.2 4 MPL NM_005373.2 5 MPL NM_005373.2 6 MPL NM_005373.2 7 MPL NM_005373.2 8 MPL NM_005373.2 9 MPL NM_005373.2 10 MPL NM_005373.2 11 MPL NM_005373.2 12 MPL XM_005270874.1 1 MPL XM_005270874.1 2 MPL XM_005270874.1 3 MPL XM_005270874.1 4 MPL XM_005270874.1 5 MPL XM_005270874.1 6 MPL XM_005270874.1 7 MPL XM_005270874.1 8 MPL XM_005270874.1 9 MPL XM_005270874.1 10 MPL XM_005270874.1 11 MPL XM_005270874.1 12

the operation

groupBy -g 1 -c 2,3 -o distinct,distinct

Outputs: MPL NM_005373.2,XM_005270874.1 1,10,11,12,2,3,4,5,6,7,8,9

As you can see , even though my input is sorted alphabetically on the 2 column and numerically on the 3rd column, the "distinct" operation does not retain the ordering. and forced alphabetical ordering on the output. would you consider implementing a sorted_num_distinct operation?

Or at least retain the input order of the numbers ?

Thanks

Duarte