WRI-Cities / static-GTFS-manager

GUI interface for creating, editing, exporting of static GTFS data for a public transit authority
GNU General Public License v3.0
147 stars 46 forks source link

Cannot upload existing GTFS feed #82

Closed laidig closed 6 years ago

laidig commented 6 years ago

I was trying to upload an existing feed and ran into the error below. There is a numpy int64 type which is not directly serializable to JSON.

A fix appears to be here. https://stackoverflow.com/questions/27050108/convert-numpy-type-to-python/27050186#27050186

static-gtfs-manager_1  | Traceback (most recent call last):
static-gtfs-manager_1  |   File "/root/.local/lib/python3.6/site-packages/tornado/web.py", line 1541, in _execute
static-gtfs-manager_1  |     result = method(*self.path_args, **self.path_kwargs)
static-gtfs-manager_1  |   File "GTFSManager.py", line 605, in post
static-gtfs-manager_1  |     importGTFS(dbfile, zipname)
static-gtfs-manager_1  |   File "<string>", line 110, in importGTFS
static-gtfs-manager_1  |   File "/root/.local/lib/python3.6/site-packages/tinydb/database.py", line 435, in insert_multiple
static-gtfs-manager_1  |     self._write(data)
static-gtfs-manager_1  |   File "/root/.local/lib/python3.6/site-packages/tinydb/database.py", line 370, in _write
static-gtfs-manager_1  |     self._storage.write(values)
static-gtfs-manager_1  |   File "/root/.local/lib/python3.6/site-packages/tinydb/database.py", line 107, in write
static-gtfs-manager_1  |     self._storage.write(raw_data)
static-gtfs-manager_1  |   File "/root/.local/lib/python3.6/site-packages/tinydb/storages.py", line 110, in write
static-gtfs-manager_1  |     serialized = json.dumps(data, **self.kwargs)
static-gtfs-manager_1  |   File "/usr/local/lib/python3.6/json/__init__.py", line 238, in dumps
static-gtfs-manager_1  |     **kw).encode(obj)
static-gtfs-manager_1  |   File "/usr/local/lib/python3.6/json/encoder.py", line 201, in encode
static-gtfs-manager_1  |     chunks = list(chunks)
static-gtfs-manager_1  |   File "/usr/local/lib/python3.6/json/encoder.py", line 430, in _iterencode
static-gtfs-manager_1  |     yield from _iterencode_dict(o, _current_indent_level)
static-gtfs-manager_1  |   File "/usr/local/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict
static-gtfs-manager_1  |     yield from chunks
static-gtfs-manager_1  |   File "/usr/local/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict
static-gtfs-manager_1  |     yield from chunks
static-gtfs-manager_1  |   File "/usr/local/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict
static-gtfs-manager_1  |     yield from chunks
static-gtfs-manager_1  |   File "/usr/local/lib/python3.6/json/encoder.py", line 437, in _iterencode
static-gtfs-manager_1  |     o = _default(o)
static-gtfs-manager_1  |   File "/usr/local/lib/python3.6/json/encoder.py", line 180, in default
static-gtfs-manager_1  |     o.__class__.__name__)
static-gtfs-manager_1  | TypeError: Object of type 'int64' is not JSON serializable
answerquest commented 6 years ago

Hi, I'm developing a different way of storing the data, in HDF5 format. Along the way I made many other changes too, like tightly type-casting all the defined GTFS fields as either string or number. So there will be changes soon. At present I've changed some of the functions but many of the other functions are breaking and need work, so I haven't put the changed code up on github yet (how does one push a folder from ubuntu terminal as a development branch?). But can you share the column and the values that made this error happen?

Here's the type-casting:

GTFS_dtypes= {
    'route_id':'str', 'route_short_name':'str', 'route_long_name':'str', 'route_text_color':'str', 'route_color' :'str', #routes
    'stop_id':'str', 'zone_id':'str', 'stop_name':'str', #stops
    'trip_id' :'str', 'block_id':'str', 'trip_headsign':'str', 'direction_id':'int64', #trips
    'trans_id':'str', 'translation':'str', #translations
    'agency_id':'str', 'agency_name':'str', 'agency_timezone':'str', 'agency_url':'str', #agency
    'shape_id':'str', 'shape_pt_lat':'float64', 'shape_pt_lon':'float64', 'shape_pt_sequence':'int64', #shapes
    'arrival_time':'str', 'departure_time':'str', 'stop_sequence':'int64', 'stop_headsign':'str', 'timepoint':'int64', 'pickup_type':'int64', 'drop_off_type':'int64', #stop_times
    'service_id':'str', 'monday':'int64', 'tuesday':'int64', 'wednesday':'int64', 'thursday':'int64','friday':'int64', 'saturday':'int64', 'sunday':'int64', 'start_date':'str', 'end_date':'str', #calendar
    'fare_id':'str', 'price':'float64', 'currency_type':'str', 'payment_method':'int64', 'transfers':'int64', #fare-attributes
    'origin_id':'str', 'destination_id':'str', 'contains_id':'str' #fare-rules
}
answerquest commented 6 years ago

@laidig here is the terminal errors at my end. I'll share the code snippet too. I think we have to find where the stop_times.txt file is causing error.

df = pd.read_csv(unzipFolder + txtfile , na_filter=False, dtype=GTFS_dtypes)

Terminal:

Saving filename: google_transit.zip to uploads/
Extracting uploaded zip to uploads/unzip-150232/
Extracted files: ['agency.txt', 'calendar.txt', 'calendar_dates.txt', 'fare_attributes.txt', 'fare_rules.txt', 'feed_info.txt', 'frequencies.txt', 'routes.txt', 'shapes.txt', 'stops.txt', 'stop_times.txt', 'transfers.txt', 'trips.txt']
Removed .h5 files from db/
GTFS/sequence.json purged.
Commencing conversion of gtfs feed files into the DB's .h5 files
db/agency.h5: 1 rows
db/calendar.h5: 5 rows
db/calendar_dates.h5: 35 rows
db/fare_attributes.h5: 1 rows
db/fare_rules.h5: 5 rows
db/feed_info.h5: 1 rows
db/frequencies.h5: 0 rows
db/routes.h5: 5 rows
db/shapes.h5: 642 rows
db/stops.h5: 111 rows
ERROR:tornado.application:Uncaught exception POST /API/gtfsImportZip?pw=kmrl (::1)
HTTPServerRequest(protocol='http', host='localhost:5000', method='POST', uri='/API/gtfsImportZip?pw=kmrl', version='HTTP/1.1', remote_ip='::1')
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1156, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 1541, in _execute
    result = method(*self.path_args, **self.path_kwargs)
  File "GTFSManager.py", line 645, in post
    if importGTFS(zipname):
  File "<string>", line 145, in importGTFS
  File "/home/nikhil/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/nikhil/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "/home/nikhil/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/home/nikhil/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1162, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: ''
ERROR:tornado.access:500 POST /API/gtfsImportZip?pw=kmrl (::1) 3408.06ms
answerquest commented 6 years ago

Hi, sorry for getting back on this late. I figured out the cause of the error:

Pandas cannot tolerate reading integer columns having blanks when reading a csv.

I tried skirting around that problem by simply excluding those columns from the strict dtypes casting dict, letting pandas read them "naturally".

And that led to another quirk of pandas:

When it's time to write to CSV, any number that hasn't been explicitly defined as int, ends up getting written as 1.0 etc.

And that leads to all these optional flags getting written to CSV as floats, which the GTFS validator flags as error.

So it was just by chance that in all the data I've used so far, the various flag columns were either fully populated or fully blank. Whereas in the GTFS spec many fields like timepoint can just as well be left blank for some entries. Thus, this bug was a serious flaw in the program waiting to light up, and it did in your case.

I tried a whole lot of re-type-castings etc, even thought I'd solved it at one point only to find out the fix was for display and it was still writing to csv the same old way.

So then the way forward became : those columns need to be read as string. Pandas is fine with string entries being blank. While doing operations, we can easily int the individual values or just compare with string '1's and '0's to get the job done. They're just flags after all.

Having a large number of optional columns in the spec and there being possibility for extra columns introduced by operators for their internal purposes, the requirement then translates to:

When importing a GTFS feed, read all columns as string unless specified otherwise.

I'd actually posted a question relating to this in stackoverflow earlier in April,
https://stackoverflow.com/questions/49684951/pandas-read-csv-dtype-read-all-columns-but-few-as-string
And got an answer posted there which can resolve this issue and also reduce the current bloat of the huge GTFS_dtypes dict.

So I'll work on implementing this for the next release.

laidig commented 6 years ago

Thanks for the update. I might be able to help you think through these issues if you need help-- I'm on the GTFS slack.

On Wed, Jul 4, 2018 at 9:41 AM Nikhil VJ notifications@github.com wrote:

Hi, sorry for getting back on this late. I figured out the cause of the error:

Pandas cannot tolerate reading integer columns having blanks when reading a csv.

I tried skirting around that problem by simply excluding those columns from the strict dtypes casting dict, letting pandas read them "naturally".

And that led to another quirk of pandas:

When it's time to write to CSV, any number that hasn't been explicitly defined as int, ends up getting written as 1.0 etc.

And that leads to all these optional flags getting written to CSV as floats, which the GTFS validator flags as error.

So it was just by chance that in all the data I've used so far, the various flag columns were either fully populated or fully blank. Whereas in the GTFS spec many fields like timepoint can just as well be left blank for some entries. Thus, this bug was a serious flaw in the program waiting to light up, and it did in your case.

I tried a whole lot of re-type-castings etc, even thought I'd solved it at one point only to find out the fix was for display and it was still writing to csv the same old way.

So then the way forward became : those columns need to be read as string. Pandas is fine with string entries being blank. While doing operations, we can easily int the individual values or just compare with string '1's and '0's to get the job done. They're just flags after all.

Having a large number of optional columns in the spec and there being possibility for extra columns introduced by operators for their internal purposes, the requirement then translates to:

When importing a GTFS feed, read all columns as string unless specified otherwise.

I'd actually posted a question relating to this in stackoverflow earlier in April,

https://stackoverflow.com/questions/49684951/pandas-read-csv-dtype-read-all-columns-but-few-as-string And got an answer posted there which can resolve this issue and also reduce the current bloat of the huge GTFS_dtypes dict.

So I'll work on implementing this for the next release.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WRI-Cities/static-GTFS-manager/issues/82#issuecomment-402482682, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwdPdT2bK_fmmXEKrEN1luKMmBY_UrGks5uDMXtgaJpZM4UTlg1 .

answerquest commented 6 years ago

Fixed with v2.0.0 by type-casting all as str on import.

https://github.com/WRI-Cities/static-GTFS-manager/blob/02aceecc8ec7a00564a0c50833c34aee59b2c3c4/GTFSserverfunctions.py#L151