TravelMapping / HighwayData

Highway Data, including all systems, boundaries, etc.
7 stars 36 forks source link

Newline characters in WPT files #2439

Closed jteresco closed 1 year ago

jteresco commented 5 years ago

We've had some discussion in various GitHub Issues and Pull Requests about newline formats in files, including .wpt files, mostly related to some site update improvements @yakra has been working on.

The vast majority of files in GitHub for the HighwayData repository currently have Unix-style newlines. Even if it doesn't cause problems to have these mixed, I think it's worth trying to be consistent here.

I'm not 100% sure that this statement is correct, but I believe that when my students clone a repository I've populated with files that have Unix-style newlines onto Windows machines, Git converts to DOS-style for them to work with on Windows, then converts them back when they push back to GitHub. That might not happen here on HighwayData because of the mix of files in the origin repository on GitHub. I wonder if we convert everything to Unix-style, we'd find that behavior to eliminate steps that @michihdeu mentions about having to remember to convert manually.

I'm building a file of the current newline formats now - will add it or summarize it here soon.

jteresco commented 5 years ago

OK, here's what I've learned. We have 49869 wpt files in the repository as of this writing.

151 are empty files.

94 are reported as odd file types:

./_boundaries/b_subdiv/b.kaz_aty_man.wpt: AKT archive data
./_boundaries/b_subdiv/b.kaz_aty_zap.wpt: AKT archive data
./_boundaries/b_subdiv/b.kaz_kar_kus.wpt: AKT archive data
./ARM/armm/arm.m006.wpt: Clarion Developer (v2 and above) memo data
./ARM/armm/arm.m007.wpt: Clarion Developer (v2 and above) memo data
./CHN-HE/chng/chnhe.g05n.wpt: Motorola S-Record; binary data in text format
./ENG/gbna/eng.a0034.wpt: Clarion Developer (v2 and above) memo data
./ENG/gbna/eng.a0303.wpt: Clarion Developer (v2 and above) memo data
./ENG/gbna/eng.a3090.wpt: Clarion Developer (v2 and above) memo data
./ENG/gbna/eng.a4032.wpt: Clarion Developer (v2 and above) memo data
./ESP-CT/espa/espct.b022ter.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 980448372 records
./ESP-CT/espa/espct.c015.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 980448372 records
./ESP-CT/espa/espct.c060.wpt: Clarion Developer (v2 and above) data file, compressed, 980448372 records
./ESP-CT/espa/espct.c065.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 980448372 records
./ESP-CT/espct/espct.c014x.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 1886680168 records
./ESP-CT/espct/espct.c015x.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 980448372 records
./ESP-CT/espct/espct.c017x.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 1953785888 records
./ESP-CT/espct/espct.c031d.wpt: Clarion Developer (v2 and above) data file, compressed, 980448372 records
./ESP-CT/espct/espct.c031prax.wpt: Clarion Developer (v2 and above) data file, compressed, 1746938161 records
./ESP-CT/espct/espct.c059.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 980448372 records
./ESP-CT/espct/espct.c060x.wpt: Clarion Developer (v2 and above) data file, compressed, 980448372 records
./ESP-CT/espct/espct.c061.wpt: Clarion Developer (v2 and above) data file, locked, encrypted, compressed, 980448372 records
./ESP-CT/espct/espct.c066.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 980448372 records
./ESP-CT/espn/espct.n002mat.wpt: Clarion Developer (v2 and above) data file, locked, compressed, 980448372 records
./ESP-MD/espa/espmd.a004.wpt: Clarion Developer (v2 and above) memo data
./FRA/eure/fra.e70.wpt: ESP archive data
./FRA/eure/fra.e80.wpt: ESP archive data
./FRA/fraa/fra.a063.wpt: ESP archive data
./HUN/hunf/hun.f032.wpt: Clarion Developer (v2 and above) memo data
./HUN/hunf/hun.f354.wpt: Clarion Developer (v2 and above) memo data
./HUN/hunf/hun.f481.wpt: Clarion Developer (v2 and above) memo data
./ID/usaid/id.id022.wpt: Audio file with ID3 version 2.51.32, extended header, experimental
./ID/usaid/id.id028.wpt: Audio file with ID3 version 2.51.32, extended header, experimental
./ID/usaid/id.id032.wpt: Audio file with ID3 version 2.51.32, extended header, experimental
./ID/usaid/id.id033sprsug.wpt: Audio file with ID3 version 2.51.32, extended header, experimental
./ID/usaid/id.id097.wpt: Audio file with ID3 version 2.32.104, extended header, experimental, footer present
./ID/usaid/id.id099.wpt: Audio file with ID3 version 2.32.104, extended header, experimental, footer present
./IRL/irlr/irl.r149.wpt: Clarion Developer (v2 and above) memo data
./IRL/irlr/irl.r941.wpt: Clarion Developer (v2 and above) memo data
./ITA/ita.nsa (2).wpt: ISO-8859 text
./JPN/jpne/jpn.e067.wpt: , rawbits, bitmap
./JPN/jpne/jpn.e075.wpt: , rawbits, greymap
./KAZ/asiah/kaz.ah061mar.wpt: Clarion Developer (v2 and above) memo data
./KAZ/eure/kaz.e016.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a011.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a017.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a019.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a021.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a022.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a024.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a025.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a027.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a030.wpt: Clarion Developer (v2 and above) memo data
./KAZ/kaza/kaz.a032.wpt: Clarion Developer (v2 and above) memo data
./KS/usaks/ks.ks188.wpt: StuffIt Deluxe Segment (data) : gLim http://www.openstreetmap.org/?lat=39.337627&lon=-100.594779
./LVA/lvap/lva.p039.wpt: , rawbits, bitmap
./MAR/afrtah/mar.tah001.wpt: Dzip archive data, version 65.66
./ME/usame/me.me127.wpt: StuffIt Deluxe Segment (data) : gRd http://www.openstreetmap.org/?lat=43.815116&lon=-69.728755
./MNE/mnem/mne.m006.wpt: Clarion Developer (v2 and above) memo data
./MNE/mnem/mne.m007.wpt: Clarion Developer (v2 and above) memo data
./MNE/mnem/mne.m010.wpt: Clarion Developer (v2 and above) memo data
./MT/usamt/mt.mt013.wpt: MadTracker 2.0 Module MT2
./MT/usamt/mt.mt024.wpt: MadTracker 2.0 Module MT2
./MT/usamt/mt.mt028.wpt: MadTracker 2.0 Module MT2
./MT/usamt/mt.mt056.wpt: MadTracker 2.0 Module MT2
./MT/usamt/mt.mt083.wpt: MadTracker 2.0 Module MT2
./MT/usamt/mt.mt200buslew.wpt: MadTracker 2.0 Module MT2
./MT/usamt/mt.mt200s.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr212.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr245.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr252.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr341.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr382.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr467.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr470.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr556.wpt: MadTracker 2.0 Module MT2
./MT/usamts/mt.sr565.wpt: MadTracker 2.0 Module MT2
./NIR/eurtr/nir.moucrbel.wpt: Clarion Developer (v2 and above) memo data
./NIR/nira/nir.a020.wpt: Clarion Developer (v2 and above) memo data
./PAK/pakm/pak.m004.wpt: Clarion Developer (v2 and above) memo data
./PRT/eure/prt.e1.wpt: ESP archive data
./SAU/asimo/sau.m025.wpt: Clarion Developer (v2 and above) memo data
./TKM/asiah/tkm.ah075.wpt: Clarion Developer (v2 and above) memo data
./TKM/asiah/tkm.ah078.wpt: Clarion Developer (v2 and above) memo data
./TKM/cisa/tkm.a388.wpt: Clarion Developer (v2 and above) memo data
./UZB/asiah/uzb.ah063.wpt: Clarion Developer (v2 and above) memo data
./UZB/cisa/uzb.a377.wpt: Clarion Developer (v2 and above) memo data
./UZB/cisa/uzb.a378.wpt: Clarion Developer (v2 and above) memo data
./UZB/cisa/uzb.a379.wpt: Clarion Developer (v2 and above) memo data
./UZB/cisa/uzb.a380.wpt: Clarion Developer (v2 and above) memo data
./UZB/cism/uzb.m034.wpt: Clarion Developer (v2 and above) memo data
./UZB/cism/uzb.m037.wpt: Clarion Developer (v2 and above) memo data
./UZB/eure/uzb.e005.wpt: Clarion Developer (v2 and above) memo data
./UZB/eure/uzb.e007.wpt: Clarion Developer (v2 and above) memo data

I'm guessing most if not all of these are fine, just have some contents that match some pattern in my Mac's /usr/share/file/magic.

1163 of them have CRLF (DOS-style) newlines. Those occur in the following regions and systems:

It's very easy for me to go through and convert them all, but it means that everyone will get a big collection of files with 100% changes next time you sync up with the master. The part of me that likes consistency would like to do this, but I will not do it right away in case there are objections.

yakra commented 5 years ago

I'm guessing most if not all of these are fine, just have some contents that match some pattern in my Mac's /usr/share/file/magic.

Yup. The first few characters matching a pattern of characters used to identify a specific file format. M3, ID3, etc... I immediately thought how if I saved a WPT file beginning with "IMPM", it would be recognized as "Impulse Tracker audio (audio/x-it)" -- and when I saw "MadTracker 2.0 Module MT2", it made me smile. :)

I'm not so hot on the idea of converting, even if I would like consistency in my own files. Thinking primarily of @ajfroggie here; I don't want to have to create a confusing extra step for him before committing his files. Contributors should just be able to commit their files and be done with it, and not have to worry about extra technical rigamarole.

I just see it causing problems as Windows users create files on their Windows systems that, naturally, have DOS style newlines, and the changes are undone.

I much prefer pursuing a robustness solution, being sure that siteupdate can cope with whatever kind of newlines it's fed.

I'm not 100% sure that this statement is correct, but I believe that when my students clone a repository I've populated with files that have Unix-style newlines onto Windows machines, Git converts to DOS-style for them to work with on Windows,

FWIW, DOS-style newlines are not converted when pulling into my Linux machine.

then converts them back when they push back to GitHub.

I see DOS-style staying DOS-style, if usaal is any indication.

That might not happen here on HighwayData because of the mix of files in the origin repository on GitHub. I wonder if we convert everything to Unix-style,

IIUC, are you saying it may be because in the source repo, not all files have a uniform newline style? That just seems so... counterintuitive. I'm skeptical. That Git (Hub?) would check first whether all 49869+ files in the repo have the same kind of newline encoding, before translating on upload. (Unless maybe there's some variable that stores whether they're all consistent or mixed, but... why?)

How about the files that were first grabbed from CHM? Were these consistently Unix? Or mixed? https://github.com/TravelMapping/HighwayData/tree/b01d3e69cbe3c6a6a8721ce298e042321270533d/chm_final https://github.com/TravelMapping/HighwayData/tree/8ae94ff93bf787417a7d5520e36a3352a74f300d/hwy_data If they were consistent, then something got committed with a CRLF that didn't get converted.

we'd find that behavior to eliminate steps that @michihdeu mentions about having to remember to convert manually.

I think @michihdeu started doing that in the "old days" (of Mac -> blizzard updates) when it mattered more. @ajfroggie doesn't do this now, and I have not seen any ill effects from usaal.

michihdeu commented 5 years ago

RTFM? https://help.github.com/articles/dealing-with-line-endings/

You can switch OS on top. I don't understand what to do though. I don't enter commands but use GitHub Desktop. I guess .gitattributes file should be it but I tried this about 2 years ago and failed.

If memory serves, I've created a file with text eol=lf

yakra commented 5 years ago

I much prefer pursuing a robustness solution, being sure that siteupdate can cope with whatever kind of newlines it's fed.

This goes for the C++ translation as well. I'd like to keep a variety of file types around.

michihdeu commented 5 years ago

I've submitted a L44 with CRLF. datacheck.sh had no problem and Git shows it right.

michihdeu commented 5 years ago

GA/usaga has all files with CRLF.

I've edited two files and removed CR but Git shows only one totally changed. The other one is shown as usual: https://github.com/TravelMapping/HighwayData/pull/2449/commits/39aa1eb16c2c2c98e9df06b8a191287c63740489. Data check had no problem.

yakra commented 5 years ago

GA/usai/ga.i075.wpt already had Unix-style line feeds.

michihdeu commented 5 years ago

Ok, my bad. I-75 is GA but not usaga.

michihdeu commented 1 year ago

Is this still relevant? If not, can we close it?

yakra commented 1 year ago

I'm fine with this being closed.

jteresco commented 1 year ago

We seem to be getting along fine doing what we're doing now. I don't think our highway data managers who submit updates directly through GitHub have caused a problem with this in a long time. I do still need to do some manual CRLF style conversions on files that are emailed to me by some others, as well as fixing up permissions.

So as much as I'd like full consistency, I think any change here is not worth anyone's time.