cannin / gsoc_2024_cbioportal_chatbot

Other

0 stars 0 forks source link

Modify PubmedLoader to work with PubMedCentral #13

Closed cannin closed 3 months ago

cannin commented 5 months ago

Modify the loader to work with PMC.

Example URL

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3898398&retmode=xml

Use code from here to extract the text: extract_text()

https://gist.github.com/cannin/f4c1c21926a21f8a38de577ca2f0fc4c

XinlingWang0628 commented 5 months ago

There 65 studies don't have pmid. no_pmid_list.json

cannin commented 5 months ago

Out of how many? You still have many and should proceed with those.

XinlingWang0628 commented 5 months ago

65 out of 411.

XinlingWang0628 commented 4 months ago

So far, I loaded 250 pubmed papers using xml loader. I found there are 40 studies have no pmcid, and 28 studies have pmcid but no xml. Do you have any questions to ask chatbot about pubmed papers, this can help me to check the accuracy and make some change if needed.

XinlingWang0628 commented 4 months ago

I found there are 18 studies have same pmid list, and I am not sure if these pmid data is correct or not. Also, some of pmids in the list appeared 32 times or more. The pmid list is : "29625048,29596782,29622463,29617662,29625055,29625050,29617662,30643250,32214244,29625049,29850653".

cannin commented 4 months ago

A simple question to verify would be how many samples in study X? This could come from LangChain OpenAPI or the publication.

XinlingWang0628 commented 4 months ago

Hi Augustin, I found pmid for those 5 studies missing pmid, but I am not sure about this pmid, could you please help me to double check when you are available?
{ "name": "Gastrointestinal Stromal Tumors (MSK, Clin Cancer Res 2023)", "description": "Targeted sequencing of 469 gastrointestinal stromal tumors and their matched normals via MSK-IMPACT.", "publicStudy": true, "pmid": "36971786", "groups": "", "status": 0, "importDate": "2023-12-07 18:44:10", "allSampleCount": 469, "readPermission": true, "studyId": "gist_msk_2023", "cancerTypeId": "gist", "referenceGenome": "hg19"}.

cannin commented 4 months ago

I think the PMID is wrong. Probably this one: https://pubmed.ncbi.nlm.nih.gov/37477937/ (talk with Ruslan about fixing it; not the highest priority).

XinlingWang0628 commented 4 months ago

Got it, thank you. I am working on testing pubmed chatbot and its pdf loader.