biocypher / open-targets

9 stars 4 forks source link

Missing header on neo4j-admin import #6

Open jma1991 opened 1 year ago

jma1991 commented 1 year ago

Issue

I encountered the following error when I try to run the neo4j-admin-import-call.sh script using the latest commit d18348b:

org.neo4j.internal.batchimport.input.HeaderException: Missing header of type START_ID, among entries [:START_ID\t:END_ID\t:TYPE]
        at org.neo4j.internal.batchimport.input.csv.DataFactories$AbstractDefaultFileHeaderParser.validateHeader(DataFactories.java:293)
        at org.neo4j.internal.batchimport.input.csv.DataFactories$AbstractDefaultFileHeaderParser.create(DataFactories.java:254)
        at org.neo4j.internal.batchimport.input.csv.CsvInput.verifyHeaders(CsvInput.java:159)
        at org.neo4j.internal.batchimport.input.csv.CsvInput.<init>(CsvInput.java:120)
        at org.neo4j.internal.batchimport.input.csv.CsvInput.<init>(CsvInput.java:98)
        at org.neo4j.importer.CsvImporter.doImport(CsvImporter.java:168)
        at org.neo4j.importer.ImportCommand.execute(ImportCommand.java:268)
        at org.neo4j.cli.AbstractCommand.call(AbstractCommand.java:71)
        at org.neo4j.cli.AbstractCommand.call(AbstractCommand.java:34)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
        at picocli.CommandLine.access$1300(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
        at picocli.CommandLine.execute(CommandLine.java:2078)
        at org.neo4j.cli.AdminTool.execute(AdminTool.java:93)
        at org.neo4j.cli.AdminTool.main(AdminTool.java:79)

The neo4j-admin-import-call.sh script is copied here:

bin/neo4j-admin import --database=test --delimiter="\t" --array-delimiter="|" --quote="'" --force=true --skip-bad-relationships=true --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/HumanGene-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/HumanGene-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Efo.Disease-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Efo.Disease-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/GoTerm-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/GoTerm-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Hp.Disease-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Hp.Disease-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Mondo.Disease-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Mondo.Disease-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/MousePhenotype-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/MousePhenotype-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/MouseGene-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/MouseGene-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Literature.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Literature.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/GeneticAssociation.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/GeneticAssociation.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/AnimalModel.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/AnimalModel.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/KnownDrug.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/KnownDrug.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/RnaExpression.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/RnaExpression.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/SomaticMutation.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/SomaticMutation.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/AffectedPathway.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/AffectedPathway.GeneToDiseaseAssociation-part.*" --relationships="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/IS_PART_OF-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/IS_PART_OF-part.*"

Hardware

Computer: MacBook Air (M1, 2020) Chip: Apple M1 Memory: 16 GB macOS: Ventura 13.4.1

Dependencies

I installed all dependencies using conda package manager:

  - python=3.10
  - snakeviz=2.1.1
  - pyspark=3.3.1
  - pandas=2.0.1
  - pip:
    - bioregistry==0.6.45
    - git+https://github.com/tangentlabs/django-oscar-paypal.git
    - git+https://github.com/saezlab/DepMap-BioCypher.git
    - biocypher==0.5.4
jma1991 commented 1 year ago

After some more exploration, this might be related to #5 as I've spotted the quotation character in some other node files:

The GoTerm-part000.csv has a 3',5' string in one of the entries:

go:0097657\t'3',5'-nucleotide bisphosphate phosphatase activity'\t'Open Targets'\t'https://platform-docs.opentargets.org/licence'\t'22.11'\t'go:0097657'\t'go'\tBiologicalEntity|Entity|GoTerm|NamedThing

Likewise, the Efo.Disease-part000.csv has a Crohn's string in one of the entries:

efo:0005622\t'http://www.ebi.ac.uk/efo/EFO_0005622'\t'Crohn's colitis'\t'Crohn's disease affecting the colon.'\t\t'Open Targets'\t'https://platform-docs.opentargets.org/licence'\t'22.11'\t'efo:0005622'\t'efo'\tBiologicalEntity|Disease|DiseaseOrPhenotypicFeature|Efo.Disease|Entity|NamedThing

slobentanzer commented 1 year ago

Hi @jma1991, thanks for the report and the thorough description!

Funny error indeed, I think I have not yet encountered that one. If it really is connected to the quotes, I am not at all sure how we would get a 'missing header' from that. But I can confirm that the quotes are a constant source of trouble. I think I may have to add a parameter to the otar package that replaces quotes depending on which quotes the user chooses for the Neo4j files.

Can you tell me which version of the otar-biocypher package you are using?

SheliO commented 7 months ago

I have also encountered a similar issue with the quotes. (with the Disease-part000.csv) I am trying to import the graph to neo4j via neo4j-admin and get:

IMPORT FAILED in 550ms. Data statistics is not available. Peak memory usage: 0B Error in input data Caused by:ERROR in input data source: BufferedCharSeeker[source:C:\Users\Shelig.Neo4jDesktop\relate-data\dbmss\dbms-e1730ac4-bd71-491e-aac4-3ba74af56ebb\import\Efo.Disease-part000.csv, position:31 95, line:6] in field: description:string:4 for header: [:ID, code:string, name:string, description:string, ontology:string, source:string, licence:string, version:string, id:string, preferred_id:string, :LABEL]
raw field value: dysembryoplastic neuroepithelial tumor original error: At C:\Users\Shelig.Neo4jDesktop\relate-data\dbmss\dbms-e1730ac4-bd71-491e-aac4-3ba74af56ebb\import\Efo.Disease-part000.csv @ position 3195 - there's a fie ld starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'A beni gn glial-neuronal neoplasm. It is usually supratentorial, located, generally, in the cortex and occurs in children and young adults with a long-standing history of partial se izures. A histologic hallmark of this tumor is the 's'

slobentanzer commented 7 months ago

Hi @SheliO, thanks for the report. Unfortunately, quotes are really tricky to handle in a general manner, as people may have different preferences, and sometimes the operating system and Java also behave differently. Short-term solutions I tend to use are:

Implementing any or all of these in the framework is a design decision, and I don't feel I have enough experience with the problem and community use to propose a default solution, so keeping it in the adapter would be most consistent with the BioCypher design philosophy.

What do you think, which way forward would you prefer?

SheliO commented 7 months ago

Thank you @slobentanzer for your quick and detailed response. Because our goal is to upload the data on to a remote neo4j instance, we switched the effort from neo4j-admin to a python implementation that calls on APOC procedures. If you have any insights regarding this effort it will be much appreciated. Thanks

slobentanzer commented 7 months ago

Hi @SheliO, this sounds super interesting; we have the driver module that uses the python driver (via neo4j_utils). Would be great to compare since we have not put much work in that recently, I would be interested if you do something differently.

We found initially that the driver is a bit slow for larger datasets, which is why we defaulted to the neo4j-admin procedure so far.

slobentanzer commented 1 week ago

Hi @SheliO, are you able to share any experiences you made with the driver connection, maybe point us to a public repo? Otherwise, feel free to close this issue.