Missing header on neo4j-admin import

jma1991 commented 1 year ago

Issue

I encountered the following error when I try to run the neo4j-admin-import-call.sh script using the latest commit d18348b:

org.neo4j.internal.batchimport.input.HeaderException: Missing header of type START_ID, among entries [:START_ID\t:END_ID\t:TYPE]
        at org.neo4j.internal.batchimport.input.csv.DataFactories$AbstractDefaultFileHeaderParser.validateHeader(DataFactories.java:293)
        at org.neo4j.internal.batchimport.input.csv.DataFactories$AbstractDefaultFileHeaderParser.create(DataFactories.java:254)
        at org.neo4j.internal.batchimport.input.csv.CsvInput.verifyHeaders(CsvInput.java:159)
        at org.neo4j.internal.batchimport.input.csv.CsvInput.<init>(CsvInput.java:120)
        at org.neo4j.internal.batchimport.input.csv.CsvInput.<init>(CsvInput.java:98)
        at org.neo4j.importer.CsvImporter.doImport(CsvImporter.java:168)
        at org.neo4j.importer.ImportCommand.execute(ImportCommand.java:268)
        at org.neo4j.cli.AbstractCommand.call(AbstractCommand.java:71)
        at org.neo4j.cli.AbstractCommand.call(AbstractCommand.java:34)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
        at picocli.CommandLine.access$1300(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
        at picocli.CommandLine.execute(CommandLine.java:2078)
        at org.neo4j.cli.AdminTool.execute(AdminTool.java:93)
        at org.neo4j.cli.AdminTool.main(AdminTool.java:79)

The neo4j-admin-import-call.sh script is copied here:

bin/neo4j-admin import --database=test --delimiter="\t" --array-delimiter="|" --quote="'" --force=true --skip-bad-relationships=true --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/HumanGene-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/HumanGene-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Efo.Disease-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Efo.Disease-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/GoTerm-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/GoTerm-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Hp.Disease-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Hp.Disease-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Mondo.Disease-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Mondo.Disease-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/MousePhenotype-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/MousePhenotype-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/MouseGene-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/MouseGene-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Literature.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/Literature.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/GeneticAssociation.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/GeneticAssociation.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/AnimalModel.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/AnimalModel.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/KnownDrug.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/KnownDrug.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/RnaExpression.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/RnaExpression.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/SomaticMutation.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/SomaticMutation.GeneToDiseaseAssociation-part.*" --nodes="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/AffectedPathway.GeneToDiseaseAssociation-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/AffectedPathway.GeneToDiseaseAssociation-part.*" --relationships="/Users/James/GitHub/open-targets/biocypher-out/20230710114924/IS_PART_OF-header.csv,/Users/James/GitHub/open-targets/biocypher-out/20230710114924/IS_PART_OF-part.*"

Hardware

Computer: MacBook Air (M1, 2020) Chip: Apple M1 Memory: 16 GB macOS: Ventura 13.4.1

Dependencies

I installed all dependencies using conda package manager:

  - python=3.10
  - snakeviz=2.1.1
  - pyspark=3.3.1
  - pandas=2.0.1
  - pip:
    - bioregistry==0.6.45
    - git+https://github.com/tangentlabs/django-oscar-paypal.git
    - git+https://github.com/saezlab/DepMap-BioCypher.git
    - biocypher==0.5.4

jma1991 commented 1 year ago

After some more exploration, this might be related to #5 as I've spotted the quotation character in some other node files:

The GoTerm-part000.csv has a 3',5' string in one of the entries:

go:0097657\t'3',5'-nucleotide bisphosphate phosphatase activity'\t'Open Targets'\t'https://platform-docs.opentargets.org/licence'\t'22.11'\t'go:0097657'\t'go'\tBiologicalEntity|Entity|GoTerm|NamedThing

Likewise, the Efo.Disease-part000.csv has a Crohn's string in one of the entries:

efo:0005622\t'http://www.ebi.ac.uk/efo/EFO_0005622'\t'Crohn's colitis'\t'Crohn's disease affecting the colon.'\t\t'Open Targets'\t'https://platform-docs.opentargets.org/licence'\t'22.11'\t'efo:0005622'\t'efo'\tBiologicalEntity|Disease|DiseaseOrPhenotypicFeature|Efo.Disease|Entity|NamedThing

slobentanzer commented 1 year ago

Hi @jma1991, thanks for the report and the thorough description!

Funny error indeed, I think I have not yet encountered that one. If it really is connected to the quotes, I am not at all sure how we would get a 'missing header' from that. But I can confirm that the quotes are a constant source of trouble. I think I may have to add a parameter to the otar package that replaces quotes depending on which quotes the user chooses for the Neo4j files.

Can you tell me which version of the otar-biocypher package you are using?

SheliO commented 7 months ago

I have also encountered a similar issue with the quotes. (with the Disease-part000.csv) I am trying to import the graph to neo4j via neo4j-admin and get:

IMPORT FAILED in 550ms. Data statistics is not available. Peak memory usage: 0B Error in input data Caused by:ERROR in input data source: BufferedCharSeeker[source:C:\Users\Shelig.Neo4jDesktop\relate-data\dbmss\dbms-e1730ac4-bd71-491e-aac4-3ba74af56ebb\import\Efo.Disease-part000.csv, position:31 95, line:6] in field: description:string:4 for header: [:ID, code:string, name:string, description:string, ontology:string, source:string, licence:string, version:string, id:string, preferred_id:string, :LABEL]
raw field value: dysembryoplastic neuroepithelial tumor original error: At C:\Users\Shelig.Neo4jDesktop\relate-data\dbmss\dbms-e1730ac4-bd71-491e-aac4-3ba74af56ebb\import\Efo.Disease-part000.csv @ position 3195 - there's a fie ld starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'A beni gn glial-neuronal neoplasm. It is usually supratentorial, located, generally, in the cortex and occurs in children and young adults with a long-standing history of partial se izures. A histologic hallmark of this tumor is the 's'

slobentanzer commented 7 months ago

Hi @SheliO, thanks for the report. Unfortunately, quotes are really tricky to handle in a general manner, as people may have different preferences, and sometimes the operating system and Java also behave differently. Short-term solutions I tend to use are:

use a quote character in Neo4j that is very unusual, such as the "broken pipe", ¦, ASCII code 221. The problem with this is that you have to be sure that the character you use is not in the text content, and sometimes also the character is not accepted by Neo4j (I think because of Java).
use a character replace function in the adapter, changing the quote character (e.g., single quote) to something else or removing it. This of course changes the content of the fields, which may not be desired, and also takes more time in building the DB.
convert text fields to base64 in the adapter, and then run a function in Neo4j that goes through the entire database and converts the text back to regular. This is quite involved and requires post-processing in the database after creation, so also not ideal, although it reliably solves the quote issue and does not change the text content.

Implementing any or all of these in the framework is a design decision, and I don't feel I have enough experience with the problem and community use to propose a default solution, so keeping it in the adapter would be most consistent with the BioCypher design philosophy.

What do you think, which way forward would you prefer?

SheliO commented 7 months ago

Thank you @slobentanzer for your quick and detailed response. Because our goal is to upload the data on to a remote neo4j instance, we switched the effort from neo4j-admin to a python implementation that calls on APOC procedures. If you have any insights regarding this effort it will be much appreciated. Thanks

slobentanzer commented 7 months ago

Hi @SheliO, this sounds super interesting; we have the driver module that uses the python driver (via neo4j_utils). Would be great to compare since we have not put much work in that recently, I would be interested if you do something differently.

We found initially that the driver is a bit slow for larger datasets, which is why we defaulted to the neo4j-admin procedure so far.

slobentanzer commented 1 week ago

Hi @SheliO, are you able to share any experiences you made with the driver connection, maybe point us to a public repo? Otherwise, feel free to close this issue.

biocypher / open-targets