mismatch between legofit file names in bepe vs flatfile.py

janxkoci commented 1 year ago

Hi Alan,

I keep running into one issue regarding an inconsistency between C tools like bepe and Python tools like flatfile.py. It comes up a lot when I try to run booma, but may appear elsewhere too (I guess, I will let you know if I find another example).

Basically, I keep model outputs in separate subfolders and collect results and summaries from the parent folder (as I usually collect these for multiple models at a time, so doing it from parent folder is more convenient).

The problem arises because tools like bebe include plain filenames within their output, while the Python tools (like flatfile.py) also include paths of the input files. This leads to booma errors like this one:

booma.c:187: file name mismatch: 1Ac_1_data.legofit != hcom3_atgc/1Ac/1Ac_1_data.legofit
booma.c:948: mismatch between legofit file names in hcom3_atgc/1Ac_1.bepe and hcom3_atgc/1Ac_1.flatfile

I think it's easier to fix this in the Python tools, so I'm guessing this would be the code to look at first: https://github.com/alanrogers/legofit/blob/8cbbaaa7c744a42a2434203dec82cc94e5f0e0f5/src/flatfile.py#L122-L125

Thanks for considering this tweak.

alanrogers commented 1 year ago

In order to fix this, I need to understand the directory structure that you're using.

In my own work, each data set has its own directory, which contains the main data file (data.opf), a "boot" subdirectory for all the bootstrap replicates, and a subdirectory for each model that I fit to the data. The directory for each model contains (among other things), the .legofit files, the .bepe file, and the .flat file. To create the .bepe and .flat files, I "cd" into the directory for the relevant model and run "bepe" and "flatfile.py". Because all the .legofit files are local to that directory, the .bepe and .flat files do not end up with pathnames containing "/" characters.

In order to make this work for you, I need to understand your work flow and how your files are organized into directories and subdirectories.

alanrogers commented 1 year ago

If I understand correctly, you must have run "bepe" and "flatfile.py" from different directories. Is that necessary?

janxkoci commented 1 year ago

you must have run "bepe" and "flatfile.py" from different directories

No actually, that is the problem - I run them from the same parent directory, but the Python tools include relative paths to input files, while C tools only include base names of the input files. This is really the only problem.

Currently, I have one parent folder with lgo model files and scripts, and subfolders for different datasets. Within each dataset folder there are subfolders for input data (e.g. hcom3_atgc/data/data.opf and hcom3_atgc/data/boot*.opf) and model outputs (hcom3_atgc/1A/*.state and hcom3_atgc/1A/*.legofit).

Then I have a bash script that takes the relative path as argument and collects all info I am interested in:

#!/bin/bash

## USAGE
# bash collect_model.sh datafilter/model stage

## READ ARGS
modeldir=$1 # hcom3_tv/1Bc/
stage=${2:-1} # default=1

data=$(echo $modeldir | tr "/" "\t" | cut -f 1) # hcom3_tv
model=$(echo $modeldir | tr "/" "\t" | cut -f 2)    # 1Bc

datadir=$data/data  # hcom3_tv/data

input=${modeldir}/${model}_${stage}
output=${data}/${model}_${stage}

bepe \
    ${datadir}/data.opf \
    ${datadir}/boot*.opf \
    -L ${input}_data.legofit \
    ${input}_boot*.legofit > ${output}.bepe

resid \
    ${datadir}/data.opf \
    ${datadir}/boot*.opf \
    -L ${input}_data.legofit \
    ${input}_boot*.legofit > ${output}.resid

flatfile.py \
    ${input}_data.legofit \
    ${input}_boot*.legofit > ${output}.flatfile

bootci.py \
    ${output}.flatfile > ${output}.bootci

The files for bepe, flat, bootci etc are then in the dataset subfolders and named based on the model, e.g. hcom3_atgc/1A_1.bepe.

janxkoci commented 1 year ago

To create the .bepe and .flat files, I "cd" into the directory for the relevant model and run "bepe" and "flatfile.py". Because all the .legofit files are local to that directory, the .bepe and .flat files do not end up with pathnames containing "/" characters.

I can try to include the cd step before I run flatfile.py (and cd ../ back afterwards). But this may not be intuitive to other users, so fixing the inconsistency may be a better approach. Note that bepe and other C tools don't need this tweak, because they omit paths to input files in their output.

I can also show the folder structure with Miller columns:

Snímky obrazovky pořízený 2023-08-29 14 20 48

Here, I have a parent folder with lgo files and scripts to submit job arrays and collect results, and subfolders for different datasets (left column), each containing data/{data,boot{0..49}}.opf and results summarized with bepe, resid, flatfile.py and so on (middle column), and finally model outputs themselves in subfolders for each model (right column).

alanrogers commented 1 year ago

Sorry to have been so slow on this.

I've now changed flatfile.py so that it prints basenames rather than pathnames. The new code is in the devlp branch. If it works for you, I'll merge to master.

janxkoci commented 1 year ago

Thanks so much - booma now accepts bepe and flatfiles created from the same parent directory without problems!

alanrogers / legofit

mismatch between legofit file names in bepe vs flatfile.py #15