adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data
MIT License
40 stars 4 forks source link

Possible issue with sort command prior to tabix compression #1

Closed Shians closed 3 years ago

Shians commented 3 years ago

I am working on a package to visualise differential methylation results, and I came across this repo because of its use of tabix. I don't do this type of analysis so I haven't run the code to determine if it's a true issue. Here you are sorting with -k3,3n, but this simply sorts the nanopolish output by the starting position, if the data contained multiple chromosomes then this is not sufficient to satisfy tabix indexing which requires that all chromosomes be grouped. e.g.

chr11   -   6315330 6315330 8a772156-8640-4ae6-aadf-96df89f70eaa    -13.58  -159.06 -145.48 1   1   AACCCCGAGTT
chr11   -   6316799 6316799 8a772156-8640-4ae6-aadf-96df89f70eaa    -1.47   -86.68  -85.20  1   1   CCTCTCGGAAT
chr7    -   6730754 6730754 08fe6e80-6afb-473d-9b3d-498c10938416    -4.30   -84.01  -79.72  1   1   AAAAACGACAA
chr17   -   6802333 6802333 61bf70fa-980b-4127-9dd8-020cdac30efd    4.63    -82.03  -86.65  1   1   TGCCACGTGGA
chr17   -   6813581 6813581 61bf70fa-980b-4127-9dd8-020cdac30efd    -1.25   -102.94 -101.69 1   1   GAAAACGGACT
chr7    -   7278186 7278186 fe307f75-eafe-4ef9-9316-80f8cedc48ae    0.14    -138.77 -138.90 1   1   GCCTTCGGGTC
chr7    +   7290209 7290225 c0dcf7ee-4d93-4e4b-86cb-d85afbc6fb4c    0.83    -211.52 -212.35 1   5   GTGTGCGTGTGCGAGCGCTCGCGTATG
chr9    -   7341330 7341330 6802481c-2d53-44eb-a3b9-c93a152cb255    -0.19   -173.57 -173.38 1   1   ACAATCGACTG

is sorted by beginning but not by chromosome and attempting to run tabix -f -S 1 -s 1 -b 3 -e 4 should raise error [E::hts_idx_push] Chromosome blocks not continuous. You can fix this by using sort -k1,1V -k3,3n which will first sort by chromosome then by starting position. If the pipeline actually runs in a way that nanopolish output only ever contains 1 chromosome then this is irrelevant.

adamewing commented 3 years ago

Thanks for your interest. You're correct that in the general case you'd need to be sorted in chromosome order for tabix to work, but in this case we're looking at events (insertions) that can only span a segment of a single chromosome so this sort is sufficient.