gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

K8s: Large downloads contain duplicate data #1053

Closed muttcg closed 2 months ago

muttcg commented 2 months ago

The download - querying the result file returns the following info:

WITH
    dc AS
    (
        SELECT countDistinct(occ.gbifID) AS dc_count
        FROM file('occurrence.txt', 'TSVWithNames') AS occ
    ),
    rc AS
    (
        SELECT count(occ.gbifID) AS rc_count
        FROM file('occurrence.txt', 'TSVWithNames') AS occ
    )
SELECT
    dc.dc_count,
    rc.rc_count,
    rc.rc_count - dc.dc_count AS diff
FROM dc, rc

Query id: fa91002d-4819-4717-abc4-c36f111a9a22

┌─dc_count─┬─rc_count─┬─diff─┐
│ 11848234 │ 11850147 │ 1913 │
└──────────┴──────────┴──────┘
_______________________________________________________________________________

SELECT
    occ.gbifID,
    count(occ.gbifID) AS dc_count
FROM file('occurrence.txt', 'TSVWithNames') AS occ
GROUP BY occ.gbifID
ORDER BY dc_count DESC
LIMIT 10

Query id: 240224b9-c38c-40fc-be89-8d73601f0e86

┌──gbifID─┬─dc_count─┐
│  305853 │        2 │
│  341511 │        2 │
│  366508 │        2 │
│  920977 │        2 │
│  320061 │        2 │
│  385658 │        2 │
│   17720 │        2 │
│ 1092156 │        2 │
│  343964 │        2 │
│  399727 │        2 │
└─────────┴──────────┘
fmendezh commented 2 months ago

it looks that the problem is somewhere else:

select gbifid, datasetkey from dev2.occurrence where gbifID = '285465';
 gbifid |              datasetkey
--------+--------------------------------------
 285465 | b4ae1720-1431-49ee-bfeb-8146fc42b1a3
 285465 | 876c11ed-b8a6-4b4b-868f-da9b91ab330c
muttcg commented 2 months ago

Looks like the issue appeared after dropping hbase key lookup table