Closed acviana closed 10 years ago
All the data has been delivered so you can start working on this now.
So this is what I need to do about WFPC2:
/astro/3/mutchler/mt/archive/wfpc2
to /astro/mtpipeline/archive/wfpc2
/astro/3/mutchler/mt/drizzled/
to /astro/mtpipeline/mtpipeline_outputs/wfpc2
/astro/3/mutchler/mt/logs/
to /astro/mtpipeline/logs/
Is that correct?
After this I will need to change some paths in the scripts, right?
Let's cp
instead of mv
just to be safe :)
And yes, we will have to change some paths. Hopefully, those path should be in your settings.yaml
file and not hard coded into the scripts, but if they are we can replace with references to the yaml generated SETTINGS
dictionary.
After I copied the files, what should I do about the old ones?
I've already copied the logs
folder and changed settings.yaml
file. Still waiting for the system to copy the input and output files.
Let's leave the old ones for now, we can delete them later if we need to.
This is the script I implemented to organize the files:
import glob
import os
import sys
import shutil
from astropy.io import fits
from mtpipeline.get_settings import SETTINGS
def organize_filetree(path):
"""
Organizes the new files, copying them to their new folders.
Parameters:
input: path
The folder in which the files will be organized.
Returns:
nothing
Output:
nothing
"""
all_files_list = glob.glob(os.path.join(path, '*.fits'))
flt_file_list = [filename for filename
in all_files_list
if filename.split('_')[-1] == 'flt.fits']
flt_file_list = set(flt_file_list)
all_files_list = set(all_files_list)
filetree_dict = {}
for flt_file in flt_file_list:
hdulist = fits.open(flt_file)
new_folder = str(hdulist[0].header['PROPOSID']) + '_' + hdulist[0].header['TARGNAME'] + '/'
basename = flt_file.split('/')[-1].split('_')[0]
if not os.path.exists(os.path.join(path, new_folder)):
os.mkdir(os.path.join(path, new_folder))
filetree_dict[basename] = new_folder
for file in all_files_list:
basename = file.split('/')[-1].split('_')[0]
shutil.copy2(file, path + filetree_dict[basename])
if __name__ == '__main__':
organize_filetree(SETTINGS['wfc3_input_path'])
organize_filetree(SETTINGS['acs_input_path'])
It seems to be working. The folders are being created and the files will be copied to them next.
I'll just need to set all the group and permissions settings.
On Saturday I passed by ST to check if the process of copying the files had been completed, but it was still running.
Today only the input folder was copied, the output one is still in process because the laptop disconnected from the internet yesterday. WFC3 and ACS is almost finished, the script had a few errors but I think now it's correct.
Great! I'm very impressed at your script. I have a couple of comments but overall, great work:
flt_file_list = set(flt_file_list)
all_files_list = set(all_files_list)
Why do you (1) create sets here and (2) name the variable _list
when it's a set and not a list? Sets are great for membership testing (since they are basically a hash) but you're not doing that here.
hdulist = fits.open(flt_file)
So this will work but there is a preferred method of dealing with file objects in Python, the with
statement. This blog gives a good explanation: http://effbot.org/zone/python-with-statement.htm. So In the future I would write something more like this:
with fits.open(flt_file) as hdulist:
for hdu in hdulist:
do_science(hdu)
# Once we leave the with loop the file object is automatically closed
If you want to extract the last part of a path, if it's the top-level folder or the filename you can use os.path.basename
so I would replace the first part of this line with that.
basename = flt_file.split('/')[-1].split('_')[0]
Also, for future reference in the codebase, refer to u2mi0101t_c0m.fits
as the basename and u2mi0101t
as the rootname.
In the case that you are combining objects into a string and you need to reformat considering using the format
method. Also, if you are going to use os.path.join
you don't have to do things like adding trailing slashes, which is one of the advantages of using it in the first place.
new_folder = str(hdulist[0].header['PROPOSID']) + '_' + hdulist[0].header['TARGNAME'] + '/'
So I would rewrite this to be something like this:
new_folder = '{}_{}'.format(hdulist[0].header['PROPOSID'], hdulist[0].header['TARGNAME'])
Here I would add os.path.join
and don't use file
as a variable because it is a keyword in Python: https://docs.python.org/2/library/functions.html#file. Yes, Python is crazy like that, by which I mean dynamically typed, and it will let you overwrite an existing variable, even if it is a built-in part of the language.
shutil.copy2(file, path + filetree_dict[basename])
WFPC2, WFC3 and ACS are organized. However, I didn't delete the old files yet.
I put WFPC2 outputs to run in the Virtual Machine. There's still 144 folders to be copied.
The pipeline is running on 05836_saturn
. There's still at about 400 png files missing there.
All files were copied.
Has /astro/3/mutchler/mt/wfpc2/tmp/
something important? I think I don't have permission to access this folder and I don't know if it was copied correctly.
I updates the group ownership permissions with the following commands:
$ pwd
/astro/mtpipeline/archive
$ find . -name '*.fits' | xargs chgrp -v mtpipe
$ find . -type d | xargs chgrp -v mtpipe
I used the following to update /astro/mtpipeline/mtpipeline_outputs/wfpc2
:
$ chgrp -R mtpipe /astro/mtpipeline/mtpipeline_outputs/
I tried to change the file permissions in /astro/mtpipeline/archive
but I don't have permission to do that.
These groups are all correct, it's just the wfpc2
folder that needs to be set to 775
.
$ ls -l
total 5366
drwxrwxr-x 492 viana STSCI\mtpipe 1829888 Jun 14 21:29 acs
drwxrwxr-x 298 viana STSCI\mtpipe 907264 Jun 13 16:33 wfc3
drwxr-xr-x 182 wbarbosa STSCI\mtpipe 10240 Jun 15 09:21 wfpc2
Now it's set to 775
:
$ ls -l /astro/mtpipeline/archive/
total 5366
drwxrwxr-x 492 viana STSCI\mtpipe 1829888 Jun 14 21:29 acs
drwxrwxr-x 298 viana STSCI\mtpipe 907264 Jun 13 16:33 wfc3
drwxrwxr-x 182 wbarbosa STSCI\mtpipe 10240 Jun 15 09:21 wfpc2
$ ls -l /astro/mtpipeline/mtpipeline_outputs/
total 22
drwxrwxr-x 191 wbarbosa STSCI\mtpipe 11264 Jun 17 02:23 wfpc2
Great, let me know when all the files are copied over as well.
I think all the files were copied. Now we just need to complete running the pipeline.
Is the pipeline running in the old /astro/3/...
path or in /astro/mtpipeline/...
? Can you use the log file to give me a rough estimate of how long it should take for the pipeline to run to completion?
It's running in /astro/mtpipeline/...
.
From 10am to 12pm today, the pipeline created around 3600 files. A rough estimate is that the pipeline will take 130 hours to complete creating all the remaining 172600 files.
:astonished: :astonished: :astonished:
Yeah ... we might need to look at improving the performance a little. I'll wait for your performance analysis in #86 before making any recommendations.
I saved the organize_filetree.py
script in /MTPipeline/scripts/production/
. Where should it be located?
Hmm. I'm not sure. Some of the code you created is kind of one-off right? I'm not sure if we'll ever need to use this again. How many lines is the script?
Yes, I think it is.
It has 37 lines.
So just copy the code into here in case we need it again and let's leave it out of the repository. You can close the ticket after you paste the code into the ticket.
Here it is the code for organize_filetree.py
:
import glob
import os
import sys
import shutil
from astropy.io import fits
from mtpipeline.get_settings import SETTINGS
def organize_filetree(path):
"""
Organizes the new files, copying them to their new folders.
Parameters:
input: path
The folder in which the files will be organized.
Returns:
nothing
Output:
nothing
"""
all_files_list = glob.glob(os.path.join(path, '*.fits'))
all_files_set = set(all_files_list)
for fits_file in all_files_set:
with fits.open(fits_file) as hdulist:
proposed_folder = '{}_{}'.format(hdulist[0].header['PROPOSID'], hdulist[0].header['TARGNAME'])
proposed_path = os.path.join(path, proposed_folder)
rootname = os.path.basename.split('_')[0]
if not os.path.exists(proposed_path):
os.mkdir(proposed_path)
shutil.copy2(fits_file, proposed_path)
elif not os.path.exists(proposed_path + rootname):
shutil.copy2(fits_file, proposed_path)
if __name__ == '__main__':
organize_filetree(SETTINGS['wfc3_input_path'])
organize_filetree(SETTINGS['acs_input_path'])
It looks like there are a lot of fits images in mtpipeline/archive/acs' and
mtpipeline/archive/wfc3' that aren't in folders associated with any target, but are just in the main directory for their instrument.
I think it's because we didn't delete the files after we organized them in their respective folders.
For example, ibxa62ojq_flt.fits
is in the original folder /astro/mtpipeline/archive/wfc3
and in the correct folder /astro/mtpipeline/archive/wfc3/12801_PLUTO-CHARON-3RD-QUADRANT/
.
Ok, in which case we should delete the originals in /astro/mtpipeline/archive/wfc3/
and only keep the copies that are in the subfolders. That should leave only one copy of each file if I am understanding correctly.
I will check if all the files are in their respective folder. First for wfc3
and then for acs
.
:+1:
The files in wfc3
are all in their respective folders, we can delete the original ones now.
However, I need to organize the files in acs
, only some of them are in their subfolders.
All files are now in their respective subfolder.
Should we do something before deleting the original ones?
After doing what we have to do, is this the command to delete the files?
rm -R /astro/mtpipeline/archive/acs/*.fits
rm -R /astro/mtpipeline/archive/wfc3/*.fits
You can step over the files and compare the md5 checksums to ensure that nothing was corrupted.
Just a heads up. I am currently running the following script to get the md5 checksum on the files (acs and wfc3) to see if they were copied correctly:
import hashlib, os, glob
def getmd5(filename):
return hashlib.md5(open(filename).read()).hexdigest()
lista = glob.glob('/astro/mtpipeline/archive/acs/*_*/*.fits')
lista += glob.glob('/astro/mtpipeline/archive/acs/*.fits')
#lista = glob.glob('/astro/mtpipeline/archive/wfc3/*_*/*.fits')
#lista += glob.glob('/astro/mtpipeline/archive/wfc3/*.fits')
md5s = {}
for f in lista:
num = getmd5(f)
basename = os.path.basename(f)
if basename not in md5s:
md5s[basename] = []
md5s[basename].append(num)
Still waiting them to complete to check the values.
:+1:
Only wfc3
had a corrupted fits file (ibtp19i6q_flt.fits
). I've removed the corrupted one and copied it again.
Now I think we can delete the original ones. Can I delete them? I don't know if have permission to do that though.
Good idea to check this, I didn't think anything would have been corrupted but I was wrong. Yes, you can delete the duplicates now. Please close the issue when you're done. If you don't have permission let me know and I can do it.
Ok, I could remove the files using rm -Rf /astro/mtpipeline/archive/wfc3/*.fits
and rm -Rf /astro/mtpipeline/archive/acs/*.fits
.
Closed.
Some data is currently still being transferred so wait till I confirm all the data has arrived before implementing these changes. You can start figuring out the process in the meantime though.
The MAST data archive at STScI has provided us a complete data dump of the ACS and WFC3 moving target data. These data are in
/astro/mtpipeline/archive/
. Our WFPC2 data are currently in/astro/3/mutchler/mt/archive/
. The IT division has provided us with a larger storage area to accommodate all the data products we will be producing under/astro/mtpipeline/
The goal is to transfer all the data from all 3 instruments into this area. So that it looks like this:
For the WFPC2 instrument this involves:
mv
to transfer thearchive/
foldermv
to transfer thedrizzled/
folder to amtpipeline_outputs/
foldermv
to transfer thelogs/
folderFor the ACS and WFC3 instruments this involves:
proposid
andtargname
keywords<proposal_id>_<target_name>/
if it doesn't already exist