EGA-archive / beacon2-ri-api

Beacon v2 Reference Implementation (API)
Apache License 2.0
16 stars 38 forks source link

Query individuals by sex returns wrong results #306

Closed AlexCork1 closed 5 months ago

AlexCork1 commented 5 months ago

I think there are bugs in querying individuals by sex. If I enter for example "NCIT:C16576" the result set contains male as well. Also female count is to low.

Steps to reproduce:

  1. follow this steps: https://github.com/EGA-archive/beacon2-ri-api/tree/master/deploy
  2. perform this query: http GET http://localhost:5050/api/individuals?filters=NCIT:C16576 (get only women - NCIT:C16576)
  3. in response summary, property numTotalResults has value 15, but count through the individuals.json returns 1271.
  4. there are 10 results in resultSets and of those 10, there are 5 with "sex": { "id": "NCIT:C16576", "label": "female"}} and 5 with "sex": {"id": "NCIT:C20197","label": "male"}.
costero-e commented 5 months ago

Hi @AlexCork1 , thank you for your report. First of all, please, can you update your beacon to the last version? This issue may be related to a bug coming from an old version of the api. Secondly, the CINECA dataset should return 1271 results, indeed, that's what should appear in your numTotalResults if you only have one dataset. Remember to execute the script reindex.py after you inject all the data (maybe error comes from there). And thirdly, your query should look like this:

curl \
  -H 'Content-Type: application/json' \
  -X POST \
  -d '{
    "meta": {
        "apiVersion": "2.0"
    },
    "query":{ "requestParameters": {
        },
        "filters": [
{"id":"NCIT:C16576", "scope":"individual"} ],
        "includeResultsetResponses": "HIT",
        "pagination": {
            "skip": 0,
            "limit": 10
        },
        "testMode": false,
        "requestedGranularity": "record"
    }
}' \
  http://localhost:5050/api/individuals

Please, try again this updating the beacon container with last version of GH master branch and tell me if this solved the issue for you.

Thanks,

Oriol

albodrug commented 5 months ago

Hello,

I have the same issue when sending calls to biosamples or individuals. The API always returns 15 in numTotalResults.

I have a test dataset of 5 biosamples and 5 individuals only, so I knew it was not possible to get 15...

I went looking through the functions and it looks like in beacon/db/utils.py function get_count() there is an exception to return 15 when it can't count (lines 72 to 74 in utils.py).

I tried to debug it but I can't figure out why exactly the count does not work. I would also appreciate any input!

I saw no issues with calls to g_variants : )

beacon             | [beacon.request.handlers][DEBUG ] (L53) 10
beacon             | [beacon.request.handlers][DEBUG ] (L54) meta=RequestMeta(requested_schemas=[], api_version='v2.0.0') query=RequestQuery(filters=[], include_resultset_responses=<IncludeResultsetResponses.HIT: 'HIT'>, pagination=Pagination(skip=0, limit=10), request_parameters={'filters': 'NCIT:C16576'}, test_mode=False, requested_granularity=<Granularity.RECORD: 'record'>, scope=None)
beacon             | [beacon.request.handlers][DEBUG ] (L59) None
beacon             | [beacon.request.handlers][DEBUG ] (L69) public
beacon-permissions | public
beacon-permissions | visa_datasets: []
beacon-permissions | ['ICAN_DATASET_3K']
beacon             | [beacon.request.handlers][DEBUG ] (L75) ['ICAN_DATASET_3K']
beacon             | [beacon.request.handlers][DEBUG ] (L76) public
beacon             | [beacon.request.handlers][DEBUG ] (L78) all datasets:  [['ICAN_DATASET_3K'], []]
beacon             | [beacon.request.handlers][ INFO ] (L79) resolved datasets:  ['ICAN_DATASET_3K']
beacon             | [beacon.request.handlers][DEBUG ] (L80) True
beacon             | [beacon.request.handlers][DEBUG ] (L81) []
beacon             | [beacon.db.utils][DEBUG ] (L42) Returning estimated count
beacon             | [beacon.db.utils][DEBUG ] (L83) FINAL QUERY: {}
beacon             | [beacon.db.utils][DEBUG ] (L84) 0
beacon             | [beacon.request.handlers][DEBUG ] (L107) ['ICAN_DATASET_3K']
beacon             | [beacon.request.handlers][DEBUG ] (L110) []
beacon             | [beacon.request.handlers][DEBUG ] (L111) ['ICAN_DATASET_3K']
beacon             | [beacon.request.handlers][DEBUG ] (L146) ['ICAN_DATASET_3K']
beacon             | [beacon.request.handlers][DEBUG ] (L149) ICAN_DATASET_3K
beacon             | [beacon.db.individuals][ INFO ] (L208) {'_id': ObjectId('6616b767421532a7970d0b39'), 'biosampleId': 'B00GWDY', 'geographicOrigin': {'id': 'NCIT:C16592', 'label': 'France'}, 'id': 'i45716', 'sex': {'id': 'NCIT:C16576', 'label': 'female'}}
beacon             | [beacon.db.individuals][ INFO ] (L208) {'_id': ObjectId('6616b767421532a7970d0b3a'), 'biosampleId': 'B00GWE2', 'geographicOrigin': {'id': 'NCIT:C16592', 'label': 'France'}, 'id': 'i46727', 'sex': {'id': 'NCIT:C20197', 'label': 'male'}}
beacon             | [beacon.db.individuals][ INFO ] (L208) {'_id': ObjectId('6616b767421532a7970d0b3b'), 'biosampleId': 'B00GWE0', 'geographicOrigin': {'id': 'NCIT:C16592', 'label': 'France'}, 'id': 'i46385', 'sex': {'id': 'NCIT:C16576', 'label': 'female'}}
beacon             | [beacon.db.individuals][ INFO ] (L208) {'_id': ObjectId('6616b767421532a7970d0b3c'), 'biosampleId': 'B00GWE1', 'geographicOrigin': {'id': 'NCIT:C16592', 'label': 'France'}, 'id': 'i46527', 'sex': {'id': 'NCIT:C20197', 'label': 'male'}}
beacon             | [beacon.db.individuals][ INFO ] (L208) {'_id': ObjectId('6616b767421532a7970d0b3d'), 'biosampleId': 'B00GWDZ', 'geographicOrigin': {'id': 'NCIT:C16592', 'label': 'France'}, 'id': 'i42629', 'sex': {'id': 'NCIT:C16576', 'label': 'female'}}
beacon             | [beacon.db.individuals][DEBUG ] (L211) {'$and': []}
beacon             | [beacon.db.individuals][DEBUG ] (L212) True
beacon             | [beacon.db.filters][DEBUG ] (L19) {'$and': []}
beacon             | [beacon.db.filters][DEBUG ] (L22) {}
beacon             | [beacon.db.filters][DEBUG ] (L256) {'$and': [{'id': 'NCIT:C16576'}, {'scope': None}]}
beacon             | [beacon.db.utils][DEBUG ] (L83) FINAL QUERY: {'$and': [{'id': 'NCIT:C16576'}, {'scope': None}]}
beacon             | [beacon.db.utils][DEBUG ] (L84) 0
beacon             | [beacon.db.utils][DEBUG ] (L83) FINAL QUERY: {'$and': [{'id': {'$regex': ''}}, {'scope': None}]}
beacon             | [beacon.db.utils][DEBUG ] (L84) 0
beacon             | [beacon.db.filters][DEBUG ] (L299) {'$or': [{'.id': 'NCIT:C16576'}]}
beacon             | [beacon.db.filters][DEBUG ] (L256) {'$and': [{'id': 'NCIT:C16576'}, {'scope': None}]}
beacon             | [beacon.db.utils][DEBUG ] (L83) FINAL QUERY: {'$and': [{'id': 'NCIT:C16576'}, {'scope': None}]}
beacon             | [beacon.db.utils][DEBUG ] (L84) 0
beacon             | [beacon.db.utils][DEBUG ] (L83) FINAL QUERY: {'$and': [{'id': {'$regex': ''}}, {'scope': None}]}
beacon             | [beacon.db.utils][DEBUG ] (L84) 0
beacon             | [beacon.db.filters][DEBUG ] (L299) {'$or': [{'.id': 'NCIT:C16576'}]}
beacon             | [beacon.db.filters][DEBUG ] (L149) {'$and': [{'$and': []}, {'$or': [{'.id': 'NCIT:C16576'}]}, {'$or': [{'.id': 'NCIT:C16576'}]}]}
beacon             | [beacon.db.individuals][DEBUG ] (L33) Include Resultset Responses = HIT
beacon             | [beacon.db.utils][DEBUG ] (L197) {'$and': [{'$and': []}, {'$or': [{'.id': 'NCIT:C16576'}]}, {'$or': [{'.id': 'NCIT:C16576'}]}]}
beacon             | [beacon.db.utils][DEBUG ] (L198) 0
beacon             | [beacon.db.utils][DEBUG ] (L216) {'$and': [{'$and': []}, {'$or': [{'.id': 'NCIT:C16576'}]}, {'$or': [{'.id': 'NCIT:C16576'}]}], '$or': [{'id': 'B00GWDY'}, {'id': 'B00GWDZ'}, {'id': 'B00GWE0'}, {'id': 'B00GWE1'}, {'id': 'B00GWE2'}, {'id': 'i45716'}, {'id': 'i42629'}, {'id': 'i46385'}, {'id': 'i46527'}, {'id': 'i46727'}]}
beacon             | [beacon.db.utils][ INFO ] (L46) <pymongo.cursor.Cursor object at 0x710977f524d0>
beacon             | [beacon.db.utils][ INFO ] (L47) {'$and': [{'$and': []}, {'$or': [{'.id': 'NCIT:C16576'}]}, {'$or': [{'.id': 'NCIT:C16576'}]}], '$or': [{'id': 'B00GWDY'}, {'id': 'B00GWDZ'}, {'id': 'B00GWE0'}, {'id': 'B00GWE1'}, {'id': 'B00GWE2'}, {'id': 'i45716'}, {'id': 'i42629'}, {'id': 'i46385'}, {'id': 'i46527'}, {'id': 'i46727'}]}
beacon             | [beacon.db.utils][ INFO ] (L48) Collection(Database(MongoClient(host=['mongo:27017'], document_class=dict, tz_aware=False, connect=True, authsource='admin'), 'beacon'), 'individuals')
beacon             | [beacon.db.utils][DEBUG ] (L218) 15
beacon             | [beacon.db.utils][DEBUG ] (L219) 10
beacon             | [beacon.db.utils][DEBUG ] (L83) FINAL QUERY: {'$and': [{'$and': []}, {'$or': [{'.id': 'NCIT:C16576'}]}, {'$or': [{'.id': 'NCIT:C16576'}]}], '$or': [{'id': 'B00GWDY'}, {'id': 'B00GWDZ'}, {'id': 'B00GWE0'}, {'id': 'B00GWE1'}, {'id': 'B00GWE2'}, {'id': 'i45716'}, {'id': 'i42629'}, {'id': 'i46385'}, {'id': 'i46527'}, {'id': 'i46727'}]}
beacon             | [beacon.db.utils][DEBUG ] (L84) 0
beacon             | [beacon.request.handlers][DEBUG ] (L165) 15
beacon             | [beacon.request.handlers][DEBUG ] (L169) record
beacon             | [beacon.response.build_response][DEBUG ] (L68) 15
beacon             | [beacon.db.utils][DEBUG ] (L83) FINAL QUERY: {'id': 'ICAN_DATASET_3K'}
beacon             | [beacon.db.utils][DEBUG ] (L84) 0
beacon             | [beacon.db.utils][DEBUG ] (L83) FINAL QUERY: {}
beacon             | [beacon.db.utils][DEBUG ] (L84) 0
beacon             | [beacon.utils.stream][DEBUG ] (L25) HTTP response stream
beacon             | [beacon.utils.stream][DEBUG ] (L30) Partial content: False
beacon             | [aiohttp.access][ INFO ] (L206) 192.168.16.1 [11/Apr/2024:08:35:38 +0000] "GET /api/individuals?filters=NCIT:C16576 HTTP/1.1" 200 1219 "-" "HTTPie/2.4.0"

Thanks! Alex

costero-e commented 5 months ago

Hi @albodrug, when beacon can't insert a count into mongo counts collection, now it returns 15. I have to change that. But beacon should return a correct count if you have executed all the correct steps in deployment. Please, check if you have a counts collection created for your mongo. You can use mongoexpress if you wish, that will be displayed at http://localhost:8081. If you have a counts collection created, then, please, make sure you executed the script:

docker exec beacon python beacon/reindex.py

Execute it again and try the query again the same way I pasted in the comment answering to AlexCork1, please. Tell me if this solved your issue. Thank you, Oriol

albodrug commented 5 months ago

Hi @costero-e

When using the curl command you pasted, I get no results. Re-running the reindex.py empties my count table in mongo and does not solve the issue.

(labaz) bodrug-a@pp-irs1-ylt:~$ sudo docker exec beacon python beacon/reindex.py
(labaz) bodrug-a@pp-irs1-ylt:~$ curl   -H 'Content-Type: application/json'   -X POST   -d '{
    "meta": {
        "apiVersion": "2.0"
    },
    "query":{ "requestParameters": {
        },
        "filters": [
{"id":"NCIT:C16576", "scope":"individual"} ],
        "includeResultsetResponses": "HIT",
        "pagination": {
            "skip": 0,
            "limit": 10
        },
        "testMode": false,
        "requestedGranularity": "record"
    }
}'   http://localhost:5050/api/individuals | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1064    0   697  100   367  63363  33363 --:--:-- --:--:-- --:--:-- 96727
{
    "meta": {
        "beaconId": "org.ega-archive.ga4gh-approval-beacon-test",
        "apiVersion": "v2.0.0",
        "returnedGranularity": "record",
        "receivedRequestSummary": {
            "apiVersion": "2.0",
            "requestedSchemas": [],
            "filters": [
                "NCIT:C16576",
                "individual"
            ],
            "requestParameters": {},
            "includeResultsetResponses": "HIT",
            "pagination": {
                "skip": 0,
                "limit": 10
            },
            "requestedGranularity": "record",
            "testMode": false
        },
        "returnedSchemas": [
            {
                "entityType": "individual",
                "schema": "beacon-individual-v2.0.0"
            }
        ]
    },
    "responseSummary": {
        "exists": false
    },
    "response": {
        "resultSets": []
    },
    "beaconHandovers": [
        [
            {
                "handoverType": {
                    "id": "CUSTOM:000001",
                    "label": "Project description"
                },
                "note": "Project description",
                "url": "https://www.nist.gov/programs-projects/genome-bottle"
            }
        ]
    ]
}

The get command returns 15.

(labaz) bodrug-a@pp-irs1-ylt:~$ http GET http://localhost:5050/api/individuals?filters=NCIT:C16576 
HTTP/1.1 200 OK
Content-Type: application/json;charset=utf-8
Date: Thu, 11 Apr 2024 10:04:36 GMT
Server: GA4GH Approval Beacon Test v2.0 (based on Python/3.10 aiohttp/3.8.1)
Transfer-Encoding: chunked

{
    "beaconHandovers": [
        [
            {
                "handoverType": {
                    "id": "CUSTOM:000001",
                    "label": "Project description"
                },
                "note": "Project description",
                "url": "https://www.nist.gov/programs-projects/genome-bottle"
            }
        ]
    ],
    "meta": {
        "apiVersion": "v2.0.0",
        "beaconId": "org.ega-archive.ga4gh-approval-beacon-test",
        "receivedRequestSummary": {
            "apiVersion": "v2.0.0",
            "filters": [
                "NCIT:C16576"
            ],
            "includeResultsetResponses": "HIT",
            "pagination": {
                "limit": 10,
                "skip": 0
            },
            "requestParameters": {
                "filters": "NCIT:C16576"
            },
            "requestedGranularity": "record",
            "requestedSchemas": [],
            "testMode": false
        },
        "returnedGranularity": "record",
        "returnedSchemas": [
            {
                "entityType": "individual",
                "schema": "beacon-individual-v2.0.0"
            }
        ]
    },
    "response": {
        "resultSets": [
            {
                "exists": true,
                "id": "ICAN_DATASET_3K",
                "results": [],
                "resultsCount": 15,
                "resultsHandover": [
                    {
                        "handoverType": {
                            "id": "CUSTOM:000001",
                            "label": "Project description"
                        },
                        "note": "Project description",
                        "url": "https://www.nist.gov/programs-projects/genome-bottle"
                    }
                ],
                "setType": "dataset"
            }
        ]
    },
    "responseSummary": {
        "exists": true,
        "numTotalResults": 15
    }
}

Thanks, Alex

costero-e commented 5 months ago

Hi @albodrug, thanks for your reply. First of all, I see that GET requests with filters are not working properly with the last version I made, I will fix that, sorry. On the other hand, POST requests do work properly but what I see from the POST request you made is that you get no results. That may be caused because of not having data available with the filtering term you are applying, or not having datasets in public_datasets.yml or not having the relationship between the ids and their datasets. Just for you to believe me, here I show you an example I just made with the query and the response I get with the CINECA synthetic dataset:

curl \
  -H 'Content-Type: application/json' \
  -X POST \
  -d '{
    "meta": {
        "apiVersion": "2.0"
    },
    "query":{ "requestParameters": {
        },
        "filters": [
{"id":"NCIT:C16576", "scope":"individual"} ],
        "includeResultsetResponses": "HIT",
        "pagination": {
            "skip": 0,
            "limit": 1 
        },
        "testMode": false,
        "requestedGranularity": "record"
    }
}' \
  http://localhost:5050/api/individuals
{"meta":{"beaconId":"org.ega-archive.ga4gh-approval-beacon-test","apiVersion":"v2.0.0","returnedGranularity":"record","receivedRequestSummary":{"apiVersion":"2.0","requestedSchemas":[],"filters":["NCIT:C16576","individual"],"requestParameters":{},"includeResultsetResponses":"HIT","pagination":{"skip":0,"limit":1},"requestedGranularity":"record","testMode":false},"returnedSchemas":[{"entityType":"individual","schema":"beacon-individual-v2.0.0"}]},"responseSummary":{"exists":true,"numTotalResults":1271},"response":{"resultSets":[{"id":"CINECA_synthetic_cohort_EUROPE_UK1","setType":"dataset","exists":true,"resultsCount":1271,"results":[{"_id":"6616894514c916aeda0fa156","ethnicity":{"id":"NCIT:C67109","label":"White and Asian"},"id":"HG00100","interventionsOrProcedures":[{"procedureCode":{"id":"OPCS4:T77.2","label":"OPCS(v4-0.0):Wide excision of muscle"}}],"measures":[{"assayCode":{"id":"LOINC:35925-4","label":"BMI"},"date":"2021-09-24","measurementValue":{"unit":{"id":"NCIT:C49671","label":"Kilogram per Square Meter"},"value":28.27885509}},{"assayCode":{"id":"LOINC:3141-9","label":"Weight"},"date":"2021-09-24","measurementValue":{"unit":{"id":"NCIT:C28252","label":"Kilogram"},"value":74.4885}},{"assayCode":{"id":"LOINC:8308-9","label":"Height-standing"},"date":"2021-09-24","measurementValue":{"unit":{"id":"NCIT:C49668","label":"Centimeter"},"value":162.2982}}],"sex":{"id":"NCIT:C16576","label":"female"}}],"resultsHandover":{"handoverType":{"id":"CUSTOM:000001","label":"Project description"},"note":"Project description","url":"https://www.nist.gov/programs-projects/genome-bottle"}}]},"beaconHandovers":[{"handoverType":{"id":"NCIT:C189151","label":"Study Data Repository"},"note":"Colorectal Adenocarcinoma TCGA PanCancer data. The original data is <a href=\"https://gdc.cancer.gov/about-data/publications/pancanatlas\">here</a>. The publications are <a href=\"https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html\">here</a>.","url":"https://github.com/cBioPortal/datahub/tree/master/public/coadread_tcga_pan_can_atlas_2018"},{"handoverType":{"id":"CUSTOM:000001","label":"Project description"},"note":"Project description","url":"https://www.nist.gov/programs-projects/genome-bottle"}]}

Another reason of this may be that you don't have the filtering terms script executed. Please, make sure you have everything working as the deployment instructions and try back. Anyway, I will now fix the GET requests with filters so you can try a get. I tell you when I have fixed it.

Thanks, Oriol

costero-e commented 5 months ago

Hi @albodrug, get requests with filters (just to individuals for now) should be fixed now. Please, if you can try and tell me what's your outcome now, I will appreciate. Thank you, Oriol

albodrug commented 5 months ago

The get requests work after a git pull on the cineca data.

(labaz) bodrug-a@pp-irs1-ylt:~$ http GET http://localhost:5050/api/individuals?filters=NCIT:C16576 | python -m json.tool | grep numTotalResults
        "numTotalResults": 1271

I still have issues with my own data, I do execute the filtering and index scripts after data loading though... I have a bash script to load the data that finishes with:

sudo docker exec beacon python beacon/reindex.py
sudo docker exec beacon python beacon/db/extract_filtering_terms.py

I will check the configs and ymls more thoroughly as suggested.

Thanks a lot for your help and patience.

Alex

costero-e commented 5 months ago

Hi @albodrug, no problem. Thank you for reporting issues and testing beacon RI. The script looks to be doing what is needed, I think the issue may be coming from the .yml files. Please, introduce all the ids (biosample and individuals) in the dataset entry of the datasets.yml file, with the exact names (for the dataset and the ids) and be aware of case sensitivity. If you want to paste here what you have in your .yml files maybe I can help. After modifying the .yml files, try to build the beacon container again (to discard is not a problem of the container not being refreshed). Also, bear in mind that you need a datasets.json that has a document with an id that is this very same name that you write for the dataset in the .yml files. I'm here to help with beacon RI so no worries, keep asking whatever issue you have.

Best, Oriol

albodrug commented 5 months ago

@costero-e , my issue was due to a badly formatted datasets.json file. thanks for all the tips.

looking forward to the fix on biosamples count as well.

Bye, Alex

costero-e commented 5 months ago

Biosamples filters for get requests are working. The issue was only for individuals and g_variants.

Best, Oriol

AlexCork1 commented 5 months ago

Just a comment from my side. POST requests worked like charm and I am getting 1271 as numTotalResults.

Unfortunately GET request on my side is still now working correctly (I cloned repository yesterday evening). The response I get is "numTotalResults": 15, "results":[]. Here is shortened response: { "meta": {... "receivedRequestSummary": { "apiVersion": "v2.0.0", "requestedSchemas": [], "filters": [ "NCIT:C16576" ], "requestParameters": { "filters": "NCIT:C16576" }, "includeResultsetResponses": "HIT", "pagination": { "skip": 0, "limit": 10 }, "requestedGranularity": "record", "testMode": false }, "returnedSchemas": [ { "entityType": "individual", "schema": "beacon-individual-v2.0.0" } ] }, "responseSummary": { "exists": true, "numTotalResults": 15 }, "response": { "resultSets": [ { "id": "CINECA_synthetic_cohort_EUROPE_UK1", "setType": "dataset", "exists": true, "resultsCount": 15, "results": [], "resultsHandover": { "handoverType": { "id": "CUSTOM:000001", "label": "Project description" }, "note": "Project description", "url": "https://www.nist.gov/programs-projects/genome-bottle" } } ] },... }

costero-e commented 5 months ago

Hi @AlexCork1 ! This is the issue that happened before the patch I committed yesterday. Please, make sure to git fetch and then git pull and then build the beacon container again.

docker-compose up -d --build beacon

Thanks, Oriol

AlexCork1 commented 5 months ago

Thanks! It works now :)