EarthCubeInGeo / resen

GNU General Public License v3.0
7 stars 3 forks source link

Searching only *.json files to address issue #89 and #78 #90

Closed pmreyes2 closed 3 years ago

pmreyes2 commented 3 years ago

This adds a general exception to handle possible file problems in the files used for specifying resen-core images inside folder ~/.config/resen/cores. The issue is reported in #78 and in #89 As an example, one of those problems arose when a file was opened with the vim editor which creates a swap file with non-utf-8 characters in it which caused resen to crash at the start. With this fix, a warning is issued about the file that couldn't be read and resen start with no problems.

asreimer commented 3 years ago

Can we instead implement more rigorous file grabbing? The fundamental problem causing #89 is that the code is not actually only grabbing JSON files.

asreimer commented 3 years ago

More specifically, this line is the problem: https://github.com/EarthCubeInGeo/resen/blob/432c2c3f1a4b8b944691c617b802c23af36635e0/resen/Resen.py#L866

One solution is to use regex to prune the output file list of the os.listdir. Something like ([^.]*.json) should work, making sure that the whole string is matched.

pmreyes2 commented 3 years ago

@asreimer I agree. So only .json or .JSON files will be considered? or we use something like .upper() to compare only capital letters.

valentic commented 3 years ago

Which is why glob is the right answer....

On 7/13/21 2:07 PM, Ashton Reimer wrote:

@.**** commented on this pull request.


In resen/Resen.py https://github.com/EarthCubeInGeo/resen/pull/90#discussion_r669111310:

@@ -864,11 +864,17 @@ def __get_valid_cores(self):

for each JSON file in core directory, read in list of cores

cores = [] for fn in os.listdir(core_dir): +

  • if not fn.lower().endswith('json'):

This allows hidden files (those with a |.| in front) to be picked up. I don't agree that this should be acceptable. I think we can test for this with |startswith('.')|.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/EarthCubeInGeo/resen/pull/90#pullrequestreview-705648744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMC5ZVLKWPAXBGEPFRSJO3TXSTJNANCNFSM5AJV3OLA.

pmreyes2 commented 3 years ago

The latest commit changes the code to use glob as a way to search for resen-core images specification files.

valentic commented 3 years ago

I think the PR code is really hard to read and figure out what is going on:

for fn in glob.glob1(core_dir,"*.[jJ][sS][oO][nN]"):
    try:
        with open(os.path.join(core_dir,fn),'r') as f:
            cores.extend(json.load(f))
     except json.decoder.JSONDecodeError:
         print('WARNING: {} is not a valid JSON file! Skipping this file.'.format(fn))
     except:
         print('WARNING: {} can not be read. Skipping this file.'.format(fn))

There are a number of things wrong with the PR for this one. First off, there is no glob1() function in the glob module. The glob() function takes a single argument which is the pattern (not a path and filename). The pattern is a simple shell-style wildcard match, not a regular expression so the convoluted [jJ][sS]... pattern wouldn't work anyways.

I still think that the best approach is the specify that the configuration files we are looking for simply have the lower case .json extension. That is the normal convention for Unix systems and Windows filesystems are case insensitive so it doesn't matter there.

The other problem is that the open() call adds the core_dir path back in, but will already be there in the output of the glob call(). The default mode for open is 'r', so no need to specify (again, with the goal if reducing the number of symbols you need to read when looking at the code).

I would also reduce the except's to a single catch all that just notifies us that the load failed. I'm also a fan of using formatted string literals to make the code easier to read.

So the cleaned up version of the code would look like:

for filename in glob.glob(os.path.join(core_dir, '*.json')):
    try:
        with open(filename) as f:
            cores.extend(json.load(f))
    except:
        print('WARNING: Problem reading {filename}. Skipping.')

If you really wanted to read both .json and .JSON file patterns, then build them up as separate lists:

json_files = glob.glob(os.path.join(core_dir, '*.json'))
JSON_files = glob.glob(os.path.join(core_dir, '*.JSON'))

for filename in [*json_files, *JSON_files]:
...
pmreyes2 commented 3 years ago

@valentic I've followed your suggestions and I agree that those modifications make the code easier to read. There is a difference in the warning message though. This time the full path of the file is given, in all the previous versions of this code, only the file base name was part of the warning message. I prefer this version that prints the full path since it gives more information to the user on where to find the offending file. BTW, glob.glob1 does exist but it returns only filenames without directory. I was using it before in order not to touch the open code that was joining the file with the directory. I am using now glob.glob that outputs the full path.

valentic commented 3 years ago

glob1 is not part of the public interface for the glob module. See the python docs or help(glob). Writing code that uses it can potentially break because it is a private implementation detail. Its use in other python modules has been considered a bug:

https://bugs.python.org/issue16620

On 7/14/21 12:45 PM, Pablo M. Reyes wrote:

@valentic https://github.com/valentic I've followed your suggestions and I agree that those modifications make the code easier to read. There is a difference in the warning message though. This time the full path of the file is given, in all the previous versions of this code, only the file base name was part of the warning message. I prefer this version that prints the full path since it gives more information to the user on where to find the offending file. BTW, |glob.glob1| does exist but it returns only filenames without directory. I was using it before in order not to touch the |open| code that was joining the file with the directory. I am using now |glob.glob| that outputs the full path.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EarthCubeInGeo/resen/pull/90#issuecomment-880161187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMC5ZTBTK4CQPBA4B7DGTTTXXSNBANCNFSM5AJV3OLA.

pmreyes2 commented 3 years ago

I believe this is the final iteration of fixing this code.