MolSSI-Education / python_scripting_cms

Python Data and Scripting course for computational chemists
https://molssi-education.github.io/python_scripting_cms
Other
84 stars 33 forks source link

Strategy for getting files in episode 3 #5

Closed janash closed 5 years ago

janash commented 5 years ago

In the current lesson (03-multiple_files), we are using the glob module. However, in Python 3.4 and above, there is another strategy that is arguably a "better practice". The pathlib library can be used in Python 3.4 and above. It has a glob function, and can also be used to manipulate file paths.

In this case, instead of this code block:

import glob
filenames = glob.glob('outfiles/*.out')
print(filenames)

We would have

import pathlib
current_dir = Path('.', 'outfiles')
filenames = current_dir.glob('*.out')
print(list(filenames))

This addresses a second issue with the current code. The line filenames = glob.glob('outfiles/*.out') is not compatible with Windows because of the / character (Windows uses a \). Before the filepath library, this would have been fixed using os.path. However, the code block using pathlib should fix this issue.

We may consider having a section on working with file paths, as I imagine that many of the attendees will use Windows. It is also very common for people to hard code in file paths, when this can lead to problems later.

Read more here: http://blog.danwin.com/using-python-3-pathlib-for-managing-filenames-and-directories/

janash commented 5 years ago

The other option is to

import os
import glob

filenames = glob.glob(os.path.join('outfiles', '*.out'))
print(filenames)

We could separate out the filenames line to multiple lines, as it may be confusing. Using os and glob will be more commonly seen than using pathlib, and is compatible with Python 2.

armcdona commented 5 years ago

I can see the merit to doing it either of these ways. Really, most of them will be doing a fresh install of Anaconda, so they will be getting python 3.7. So we could teach them the pathlib way. That seems better to me, but only nominally so.

janash commented 5 years ago

Right, it doesn't seem to be too different. The pathlib way has one import instead of two. Honestly, I still don't know the full capabilities of pathlib. I've just reading some blog posts lately that recommend it.

If they're looking at code others have written in the past, they are much more likely to see glob. For the file paths, it's pretty common for people to code in the file path, but using os (or pathlib) is better practice as it translates between different operating systems.

I propose having a short module or section on the working with file paths before the file parsing section. We should also talk about file paths/locations in the pre-workshop tutorials (what pwd does, for example)