dedupeio / csvdedupe

:id: Command line tool for deduplicating CSV files
Other
412 stars 81 forks source link

Records do not line up with data model #55

Closed ghost closed 8 years ago

ghost commented 8 years ago
INFO:root:imported 269277 rows from file 1
INFO:root:imported 36467 rows from file 2
INFO:root:using fields: [u'id', u'geo_latitude', u'geo_longitude', u'star_rating_value', u'name', u'city', u'country', u'chain_name', u'type', u'address', u'fax', u'email', u'website']
INFO:root:taking a sample of 15000 possible pairs
Traceback (most recent call last):
  File "/usr/local/bin/csvlink", line 11, in <module>
    sys.exit(launch_new_instance())
  File "/usr/local/lib/python2.7/site-packages/csvdedupe/csvlink.py", line 169, in launch_new_instance
    d.main()
  File "/usr/local/lib/python2.7/site-packages/csvdedupe/csvlink.py", line 119, in main
    deduper.sample(data_1, data_2, self.sample_size)
  File "/Library/Python/2.7/site-packages/dedupe/api.py", line 876, in sample
    self._checkData(data_1, data_2)
  File "/Library/Python/2.7/site-packages/dedupe/api.py", line 910, in _checkData
    self.data_model.check(next(iter(viewvalues(data_2))))
  File "/Library/Python/2.7/site-packages/dedupe/datamodel.py", line 123, in check
    "in a record" % field)
ValueError: Records do not line up with data model. The field 'website' is in data_model but not in a record
fgregg commented 8 years ago

yes?

ghost commented 8 years ago

why i get this error?

fgregg commented 8 years ago

Records do not line up with data model. The field 'website' is in data_model but not in a record

ghost commented 8 years ago

i don't understand what this mean

fgregg commented 8 years ago

Your said that 'website' was a field that you wanted to compare but there is no 'website' field in your record.

ghost commented 8 years ago

i don't understand what i do wrong

csvlink file1.csv file2.csv --config_file config.json

i get

ValueError: Records do not line up with data model. The field 'fax' is in data_model but not in a record

file1.csv

id,geo_latitude,geo_longitude,star_rating_value,name,city,country,chain_name,type,address,fax,postal_code,email,website,booking_phone,management_phone,hotel_phone
107,44.457973,26.091842,,minerva,bucharest,romania,,hotel,street gheorghe manu number 2-4 sector- 1 010445 romania,40213123963,010445,reservation@minerva.ro,www.minerva.ro,0040213181294,0040213122738,+40213111555
108,44.435918,26.094242,,opera,bucharest,romania,,hotel,brezoianu street no 37 sector 1 bucharest romania,0040213124858,010132,info@hotelopera.ro,,,,0040213124857
118,54.595541,-5.933663,3,belfast central travelodge,belfast,united kingdom (great britain),travelodge,hotel,15 brunswick street belfast bt2 7ge united kingdom,441232232999,bt2 7ge,valerie.steinbeck@travelodge.ie,www.travelodge.ie,08701911687,08701911687,00448701911700
...

file2.csv

geo_latitude,geo_longitude,star_rating_value,name,city,country,chain_name,type,address,fax
44.449302,26.091212,4,minerva,bucharest,romania,minerva,hotel,street gheorghe manu number 2-4 sector- 1 010445 romania ,40213123963
44.436976,26.094423,3,opera,bucharest,romania,,hotel,brezoianu street no 37 sector 1 bucharest romania ,40213124011
54.5955,-5.9334,3,travelodge belfast,belfast,united kingdom,,hotel,15 brunswick street belfast bt2 7ge united kingdom ,441232232999
...

config.json

{
  "field_names_1": [
    "id",
    "geo_latitude",
    "geo_longitude",
    "star_rating_value",
    "name",
    "city",
    "country",
    "chain_name",
    "type",
    "address",
    "fax",
    "postal_code",
    "email",
    "website",
    "booking_phone",
    "management_phone",
    "hotel_phone"
  ],
  "field_names_2": [
    "geo_latitude",
    "geo_longitude",
    "star_rating_value",
    "name",
    "city",
    "country",
    "chain_name",
    "type",
    "address",
    "fax"
  ],
  "output_file": "output.csv",
  "skip_training": false,
  "training_file": "training.json",
  "sample_size": 15000,
  "recall_weight": 2
}
fgregg commented 8 years ago

Do all your examples in the training file incude the 'fax' field?

On Tue, Oct 25, 2016 at 10:07 AM AlexandruMV notifications@github.com wrote:

i don't understand what i do wrong

csvlink file1.csv file2.csv --config_file config.json

i get

ValueError: Records do not line up with data model. The field 'fax' is in data_model but not in a record

file1.csv

turismatic_id,geo_latitude,geo_longitude,star_rating_value,name,city,country,chain_name,type,address,fax,postal_code,email,website,booking_phone,management_phone,hotel_phone 107,44.457973,26.091842,,minerva,bucharest,romania,,hotel,street gheorghe manu number 2-4 sector- 1 010445 romania,40213123963,010445,reservation@minerva.ro,www.minerva.ro,0040213181294,0040213122738,+40213111555 <+40%2021%20311%201555> 108,44.435918,26.094242,,opera,bucharest,romania,,hotel,brezoianu street no 37 sector 1 bucharest romania,0040213124858,010132,info@hotelopera.ro,,,,0040213124857 118,54.595541,-5.933663,3,belfast central travelodge,belfast,united kingdom (great britain),travelodge,hotel,15 brunswick street belfast bt2 7ge united kingdom,441232232999,bt2 7ge,valerie.steinbeck@travelodge.ie,www.travelodge.ie,08701911687,08701911687,00448701911700 ...

file2.csv

geo_latitude,geo_longitude,star_rating_value,name,city,country,chain_name,type,address,fax 44.449302,26.091212,4,minerva,bucharest,romania,minerva,hotel,street gheorghe manu number 2-4 sector- 1 010445 romania ,40213123963 44.436976,26.094423,3,opera,bucharest,romania,,hotel,brezoianu street no 37 sector 1 bucharest romania ,40213124011 54.5955,-5.9334,3,travelodge belfast,belfast,united kingdom,,hotel,15 brunswick street belfast bt2 7ge united kingdom ,441232232999 ...

config.json

{ "field_names_1": [ "turismatic_id", "geo_latitude", "geo_longitude", "star_rating_value", "name", "city", "country", "chain_name", "type", "address", "fax", "postal_code", "email", "website", "booking_phone", "management_phone", "hotel_phone" ], "field_names_2": [ "geo_latitude", "geo_longitude", "star_rating_value", "name", "city", "country", "chain_name", "type", "address", "fax" ], "output_file": "output.csv", "skip_training": false, "training_file": "training.json", "sample_size": 15000, "recall_weight": 2 }

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/datamade/csvdedupe/issues/55#issuecomment-256062799, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbbOYNe6y92vshtxmlAZ3vfRbuD2Nks5q3hs5gaJpZM4KgDlQ .

ghost commented 8 years ago

not all examples in the training data include fax field, but i have added this to config.json

  "field_definitions" : [
    { "field" : "id", "type" : "String", "Has Missing" : true },
    { "field" : "geo_latitude", "type" : "String", "Has Missing" : true },
    { "field" : "geo_longitude", "type" : "String", "Has Missing" : true },
    { "field" : "star_rating_value", "type" : "String", "Has Missing" : true },
    { "field" : "name", "type" : "String" },
    { "field" : "city", "type" : "String", "Has Missing" : true },
    { "field" : "country", "type" : "String", "Has Missing" : true },
    { "field" : "chain_name", "type" : "String", "Has Missing" : true },
    { "field" : "type", "type" : "String", "Has Missing" : true },
    { "field" : "address", "type" : "String", "Has Missing" : true },
    { "field" : "fax", "type" : "String", "Has Missing" : true },
    { "field" : "postal_code", "type" : "String", "Has Missing" : true },
    { "field" : "email", "type" : "String", "Has Missing" : true },
    { "field" : "website", "type" : "String", "Has Missing" : true },
    { "field" : "booking_phone", "type" : "String", "Has Missing" : true },
    { "field" : "management_phone", "type" : "String", "Has Missing" : true },
    { "field" : "hotel_phone", "type" : "String", "Has Missing" : true }
  ]

same error: ValueError: Records do not line up with data model. The field 'fax' is in data_model but not in a record

fgregg commented 8 years ago

The training examples has to have the 'fax' field even if it's null or empty.

On Wed, Oct 26, 2016 at 4:02 AM AlexandruMV notifications@github.com wrote:

not all examples in the training data include fax field, but i have added this to config.json

"field_definitions" : [ { "field" : "id", "type" : "String", "Has Missing" : true }, { "field" : "geo_latitude", "type" : "String", "Has Missing" : true }, { "field" : "geo_longitude", "type" : "String", "Has Missing" : true }, { "field" : "star_rating_value", "type" : "String", "Has Missing" : true }, { "field" : "name", "type" : "String" }, { "field" : "city", "type" : "String", "Has Missing" : true }, { "field" : "country", "type" : "String", "Has Missing" : true }, { "field" : "chain_name", "type" : "String", "Has Missing" : true }, { "field" : "type", "type" : "String", "Has Missing" : true }, { "field" : "address", "type" : "String", "Has Missing" : true }, { "field" : "fax", "type" : "String", "Has Missing" : true }, { "field" : "postal_code", "type" : "String", "Has Missing" : true }, { "field" : "email", "type" : "String", "Has Missing" : true }, { "field" : "website", "type" : "String", "Has Missing" : true }, { "field" : "booking_phone", "type" : "String", "Has Missing" : true }, { "field" : "management_phone", "type" : "String", "Has Missing" : true }, { "field" : "hotel_phone", "type" : "String", "Has Missing" : true } ]

same error: ValueError: Records do not line up with data model. The field 'fax' is in data_model but not in a record

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/datamade/csvdedupe/issues/55#issuecomment-256289794, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbR-AN6AsTQCQRPWyULGsVKHVtEYeks5q3xcugaJpZM4KgDlQ .

ghost commented 8 years ago

ok. thank you

bluesky410 commented 4 years ago

fields = [{'field' : 'Region', 'type': 'String'}, {'field' : 'Country', 'type': 'String'}, {'field' : 'Item_Type', 'type': 'String'}, {'field' : 'Sales_Channel', 'type': 'String'}, {'field' : 'Order_Date', 'type': 'String', 'has missing' : True}, ] deduper = dedupe.Dedupe(fields) ...

result: WARNING:dedupe.backport:Dedupe does not currently support multiprocessing on Windows ... ValueError: Records do not line up with data model. The field 'Region' is in data_model but not in a record

What is this error I want to debug it Help me.

anandhu1436 commented 4 years ago

I have same issue eventhough record contain a field

teramike commented 1 year ago

If you changed your dataset slightly (like adding new fields) it seems you've to delete your previous training.json file or adapt it to these new fields. Thought I'd share just in case!