Data reorganization - Githubissues

acviana commented 10 years ago

Some data is currently still being transferred so wait till I confirm all the data has arrived before implementing these changes. You can start figuring out the process in the meantime though.

The MAST data archive at STScI has provided us a complete data dump of the ACS and WFC3 moving target data. These data are in /astro/mtpipeline/archive/. Our WFPC2 data are currently in /astro/3/mutchler/mt/archive/. The IT division has provided us with a larger storage area to accommodate all the data products we will be producing under /astro/mtpipeline/

The goal is to transfer all the data from all 3 instruments into this area. So that it looks like this:

.
|--- mtpipeline/
|    |--- archive/
|    |    |--- acs/
|    |    |    |--- <proposal_id>_<target_name>/
|    |    |    |--- ...
|    |    |--- wfc3/
|    |    |    |--- <proposal_id>_<target_name>/
|    |    |    |--- ...
|    |    |--- wfpc2/
|    |    |    |--- <proposal_id>_<target_name>/
|    |    |    |--- ...
|    |--- mtpipeline_outputs/
|    |    |--- acs/
|    |    |    |--- <proposal_id>_<target_name>/
|    |    |    |--- ...
|    |    |--- wfc3/
|    |    |    |--- <proposal_id>_<target_name>/
|    |    |    |--- ...
|    |    |--- wfpc2/
|    |    |    |--- <proposal_id>_<target_name>/
|    |    |    |--- ...
|    |--- logs/

For the WFPC2 instrument this involves:

using mv to transfer the archive/ folder
using mv to transfer the drizzled/ folder to a mtpipeline_outputs/ folder
using mv to transfer the logs/ folder

For the ACS and WFC3 instruments this involves:

Writing a Python script to go through each FLT file
Open the header with astropy
Get the proposid and targname keywords
Create a folder called <proposal_id>_<target_name>/ if it doesn't already exist
Copy the RAW, FLT, and SPT file into the appropriate folder.
Set all the group and permissions settings

acviana commented 10 years ago

All the data has been delivered so you can start working on this now.

walyssonBarbosa commented 10 years ago

So this is what I need to do about WFPC2:

Move /astro/3/mutchler/mt/archive/wfpc2 to /astro/mtpipeline/archive/wfpc2
Move /astro/3/mutchler/mt/drizzled/ to /astro/mtpipeline/mtpipeline_outputs/wfpc2
Move /astro/3/mutchler/mt/logs/ to /astro/mtpipeline/logs/

Is that correct?

After this I will need to change some paths in the scripts, right?

acviana commented 10 years ago

Let's cp instead of mv just to be safe :)

And yes, we will have to change some paths. Hopefully, those path should be in your settings.yaml file and not hard coded into the scripts, but if they are we can replace with references to the yaml generated SETTINGS dictionary.

walyssonBarbosa commented 10 years ago

After I copied the files, what should I do about the old ones?

I've already copied the logs folder and changed settings.yaml file. Still waiting for the system to copy the input and output files.

acviana commented 10 years ago

Let's leave the old ones for now, we can delete them later if we need to.

walyssonBarbosa commented 10 years ago

This is the script I implemented to organize the files:

import glob
import os
import sys
import shutil
from astropy.io import fits
from mtpipeline.get_settings import SETTINGS

def organize_filetree(path):
   """
        Organizes the new files, copying them to their new folders.

        Parameters:
            input: path
                The folder in which the files will be organized.

        Returns:
            nothing

        Output:
            nothing
    """
    all_files_list = glob.glob(os.path.join(path, '*.fits'))
    flt_file_list = [filename for filename
                     in all_files_list
                     if filename.split('_')[-1] == 'flt.fits']
    flt_file_list = set(flt_file_list)
    all_files_list = set(all_files_list)
    filetree_dict = {}
    for flt_file in flt_file_list:
        hdulist = fits.open(flt_file)
        new_folder = str(hdulist[0].header['PROPOSID']) + '_' + hdulist[0].header['TARGNAME'] + '/'
        basename = flt_file.split('/')[-1].split('_')[0]
        if not os.path.exists(os.path.join(path, new_folder)):
            os.mkdir(os.path.join(path, new_folder))
            filetree_dict[basename] = new_folder

    for file in all_files_list:
        basename = file.split('/')[-1].split('_')[0]
        shutil.copy2(file, path + filetree_dict[basename])

if __name__ == '__main__':
    organize_filetree(SETTINGS['wfc3_input_path'])
    organize_filetree(SETTINGS['acs_input_path'])

walyssonBarbosa commented 10 years ago

It seems to be working. The folders are being created and the files will be copied to them next.

I'll just need to set all the group and permissions settings.

walyssonBarbosa commented 10 years ago

On Saturday I passed by ST to check if the process of copying the files had been completed, but it was still running.

Today only the input folder was copied, the output one is still in process because the laptop disconnected from the internet yesterday. WFC3 and ACS is almost finished, the script had a few errors but I think now it's correct.

acviana commented 10 years ago

Great! I'm very impressed at your script. I have a couple of comments but overall, great work:

    flt_file_list = set(flt_file_list)
    all_files_list = set(all_files_list)

Why do you (1) create sets here and (2) name the variable _list when it's a set and not a list? Sets are great for membership testing (since they are basically a hash) but you're not doing that here.

        hdulist = fits.open(flt_file)

So this will work but there is a preferred method of dealing with file objects in Python, the with statement. This blog gives a good explanation: http://effbot.org/zone/python-with-statement.htm. So In the future I would write something more like this:

with fits.open(flt_file) as hdulist:
    for hdu in hdulist:
        do_science(hdu)
# Once we leave the with loop the file object is automatically closed

If you want to extract the last part of a path, if it's the top-level folder or the filename you can use os.path.basename so I would replace the first part of this line with that.

basename = flt_file.split('/')[-1].split('_')[0]

Also, for future reference in the codebase, refer to u2mi0101t_c0m.fits as the basename and u2mi0101t as the rootname.

In the case that you are combining objects into a string and you need to reformat considering using the format method. Also, if you are going to use os.path.join you don't have to do things like adding trailing slashes, which is one of the advantages of using it in the first place.

new_folder = str(hdulist[0].header['PROPOSID']) + '_' + hdulist[0].header['TARGNAME'] + '/'

So I would rewrite this to be something like this:

new_folder = '{}_{}'.format(hdulist[0].header['PROPOSID'], hdulist[0].header['TARGNAME'])

Here I would add os.path.join and don't use file as a variable because it is a keyword in Python: https://docs.python.org/2/library/functions.html#file. Yes, Python is crazy like that, by which I mean dynamically typed, and it will let you overwrite an existing variable, even if it is a built-in part of the language.

        shutil.copy2(file, path + filetree_dict[basename])

walyssonBarbosa commented 10 years ago

WFPC2, WFC3 and ACS are organized. However, I didn't delete the old files yet.

I put WFPC2 outputs to run in the Virtual Machine. There's still 144 folders to be copied.

The pipeline is running on 05836_saturn. There's still at about 400 png files missing there.

walyssonBarbosa commented 10 years ago

All files were copied.

Has /astro/3/mutchler/mt/wfpc2/tmp/ something important? I think I don't have permission to access this folder and I don't know if it was copied correctly.

acviana commented 10 years ago

I updates the group ownership permissions with the following commands:

$ pwd
/astro/mtpipeline/archive
$ find . -name '*.fits' | xargs chgrp -v mtpipe
$ find . -type d | xargs chgrp -v mtpipe

walyssonBarbosa commented 10 years ago

I used the following to update /astro/mtpipeline/mtpipeline_outputs/wfpc2:

$ chgrp -R mtpipe /astro/mtpipeline/mtpipeline_outputs/

I tried to change the file permissions in /astro/mtpipeline/archive but I don't have permission to do that.

acviana commented 10 years ago

These groups are all correct, it's just the wfpc2 folder that needs to be set to 775.

$ ls -l
total 5366
drwxrwxr-x  492 viana     STSCI\mtpipe  1829888 Jun 14 21:29 acs
drwxrwxr-x  298 viana     STSCI\mtpipe   907264 Jun 13 16:33 wfc3
drwxr-xr-x  182 wbarbosa  STSCI\mtpipe    10240 Jun 15 09:21 wfpc2

walyssonBarbosa commented 10 years ago

Now it's set to 775:

$ ls -l /astro/mtpipeline/archive/
total 5366
drwxrwxr-x  492 viana     STSCI\mtpipe  1829888 Jun 14 21:29 acs
drwxrwxr-x  298 viana     STSCI\mtpipe   907264 Jun 13 16:33 wfc3
drwxrwxr-x  182 wbarbosa  STSCI\mtpipe    10240 Jun 15 09:21 wfpc2
$ ls -l /astro/mtpipeline/mtpipeline_outputs/
total 22
drwxrwxr-x  191 wbarbosa  STSCI\mtpipe  11264 Jun 17 02:23 wfpc2

acviana commented 10 years ago

Great, let me know when all the files are copied over as well.

walyssonBarbosa commented 10 years ago

I think all the files were copied. Now we just need to complete running the pipeline.

acviana commented 10 years ago

Is the pipeline running in the old /astro/3/... path or in /astro/mtpipeline/...? Can you use the log file to give me a rough estimate of how long it should take for the pipeline to run to completion?

walyssonBarbosa commented 10 years ago

It's running in /astro/mtpipeline/....

From 10am to 12pm today, the pipeline created around 3600 files. A rough estimate is that the pipeline will take 130 hours to complete creating all the remaining 172600 files.

acviana commented 10 years ago

:astonished: :astonished: :astonished:

Yeah ... we might need to look at improving the performance a little. I'll wait for your performance analysis in #86 before making any recommendations.

walyssonBarbosa commented 10 years ago

I saved the organize_filetree.py script in /MTPipeline/scripts/production/. Where should it be located?

acviana commented 10 years ago

Hmm. I'm not sure. Some of the code you created is kind of one-off right? I'm not sure if we'll ever need to use this again. How many lines is the script?

walyssonBarbosa commented 10 years ago

Yes, I think it is.

It has 37 lines.

acviana commented 10 years ago

So just copy the code into here in case we need it again and let's leave it out of the repository. You can close the ticket after you paste the code into the ticket.

walyssonBarbosa commented 10 years ago

Here it is the code for organize_filetree.py:

import glob
import os
import sys
import shutil
from astropy.io import fits
from mtpipeline.get_settings import SETTINGS

def organize_filetree(path):
    """
        Organizes the new files, copying them to their new folders.

        Parameters:
            input: path
                The folder in which the files will be organized.

        Returns:
            nothing

        Output:
            nothing
    """
    all_files_list = glob.glob(os.path.join(path, '*.fits'))
    all_files_set = set(all_files_list)
    for fits_file in all_files_set:
        with fits.open(fits_file) as hdulist:
            proposed_folder = '{}_{}'.format(hdulist[0].header['PROPOSID'], hdulist[0].header['TARGNAME'])
            proposed_path = os.path.join(path, proposed_folder)
            rootname = os.path.basename.split('_')[0]
            if not os.path.exists(proposed_path):
                os.mkdir(proposed_path)
                shutil.copy2(fits_file, proposed_path)
            elif not os.path.exists(proposed_path + rootname):
                shutil.copy2(fits_file, proposed_path)

if __name__ == '__main__':
    organize_filetree(SETTINGS['wfc3_input_path'])
    organize_filetree(SETTINGS['acs_input_path'])

ktfhale commented 10 years ago

It looks like there are a lot of fits images in mtpipeline/archive/acs' andmtpipeline/archive/wfc3' that aren't in folders associated with any target, but are just in the main directory for their instrument.

walyssonBarbosa commented 10 years ago

I think it's because we didn't delete the files after we organized them in their respective folders.

walyssonBarbosa commented 10 years ago

For example, ibxa62ojq_flt.fits is in the original folder /astro/mtpipeline/archive/wfc3 and in the correct folder /astro/mtpipeline/archive/wfc3/12801_PLUTO-CHARON-3RD-QUADRANT/.

acviana commented 10 years ago

Ok, in which case we should delete the originals in /astro/mtpipeline/archive/wfc3/and only keep the copies that are in the subfolders. That should leave only one copy of each file if I am understanding correctly.

walyssonBarbosa commented 10 years ago

I will check if all the files are in their respective folder. First for wfc3 and then for acs.

acviana commented 10 years ago

:+1:

walyssonBarbosa commented 10 years ago

The files in wfc3 are all in their respective folders, we can delete the original ones now.

However, I need to organize the files in acs, only some of them are in their subfolders.

walyssonBarbosa commented 10 years ago

All files are now in their respective subfolder.

Should we do something before deleting the original ones?

After doing what we have to do, is this the command to delete the files?

rm -R /astro/mtpipeline/archive/acs/*.fits
rm -R /astro/mtpipeline/archive/wfc3/*.fits

acviana commented 10 years ago

You can step over the files and compare the md5 checksums to ensure that nothing was corrupted.

walyssonBarbosa commented 10 years ago

Just a heads up. I am currently running the following script to get the md5 checksum on the files (acs and wfc3) to see if they were copied correctly:

import hashlib, os, glob

def getmd5(filename):
       return hashlib.md5(open(filename).read()).hexdigest()

lista = glob.glob('/astro/mtpipeline/archive/acs/*_*/*.fits')
lista += glob.glob('/astro/mtpipeline/archive/acs/*.fits')
#lista = glob.glob('/astro/mtpipeline/archive/wfc3/*_*/*.fits')
#lista += glob.glob('/astro/mtpipeline/archive/wfc3/*.fits')

md5s = {}

for f in lista:
      num = getmd5(f)
      basename = os.path.basename(f)
      if basename not in md5s:
              md5s[basename] = []
      md5s[basename].append(num)

Still waiting them to complete to check the values.

acviana commented 10 years ago

:+1:

walyssonBarbosa commented 10 years ago

Only wfc3 had a corrupted fits file (ibtp19i6q_flt.fits). I've removed the corrupted one and copied it again.

Now I think we can delete the original ones. Can I delete them? I don't know if have permission to do that though.

acviana commented 10 years ago

Good idea to check this, I didn't think anything would have been corrupted but I was wrong. Yes, you can delete the duplicates now. Please close the issue when you're done. If you don't have permission let me know and I can do it.

walyssonBarbosa commented 10 years ago

Ok, I could remove the files using rm -Rf /astro/mtpipeline/archive/wfc3/*.fits and rm -Rf /astro/mtpipeline/archive/acs/*.fits.

walyssonBarbosa commented 10 years ago

Closed.

STScI-Citizen-Science / MTPipeline

Data reorganization #88