e-merlin / eMERLIN_CASA_pipeline

This is CASA eMERLIN pipeline to calibrate data from the e-MERLIN array. Please fork the repository before making any changes and read the Coding Practices page in the wiki. Please add issues with the pipeline in the issues tab.
GNU General Public License v3.0
14 stars 11 forks source link

Added headless version + build instructions #21

Closed jradcliffe5 closed 7 years ago

jradcliffe5 commented 7 years ago

Bigish update

jradcliffe5 commented 7 years ago

OHH almost forgot. CASA does not have the task importfitsidi under the casa module (all other tasks are there)... therefore to call casa from the command line as is with the pipeline the actual task has to be included in the install...

jmoldon commented 7 years ago

Good, thank you, looks good! I don't remember having problems with importfitsidi directly from a normal casa script. I will check it during today and tomorrow.

jmoldon commented 7 years ago

Everything looks fine. I checked the import, hanning, ms2mms. ms2mms also works if hanning is not done. The only bug I saw is that the output of importfirsidi does not go to the 'data' directory, but to the current directory. It just works changing in eMERLIN_CASA_pipeline.py

25c25
< vis = inputs['inbase']+'.ms'
---
> vis = data_dir+inputs['inbase']+'.ms'

I will try to correct that with a commit within this pull request. Hope it works!

jradcliffe5 commented 7 years ago

Cheers! Yep so its meant to do that, so that the current data that is created and calibrated is in the cwd. Otherwise we will have to change all the locations of the data if the pipeline is run all at once. There is a check in the code to see if the data is in the current working directory.

jmoldon commented 7 years ago

In general but especially with large datasets like these, I think is a good idea to keep all the data files as much isolated as possible. It is very easy with the global data path defined. Having all big files in one folder gives more flexibility to move, keep or delete any large datasets easily, exclude the folder to rsync, to work with data in external or remote disks, etc.

jradcliffe5 commented 7 years ago

I think though, even though to original method I wrote takes up more disk space, I think we should keep the processing within the current directory from which the pipeline is executed. This means that we can shift data into a SSD drive or stripe drive for quick I/O and then shift the data back after pipelining to data_dir.

The problem with processing on remote disks etc is that it increase processing speed as the data has to be passed across the network in order to be stored into memory. Also, some remote disks have slow I/O speeds, e.g. raid arrays have to copy the data three times in effect for the parity reconstruction, therefore processing directly on these is costly.

The original method firstly checks if the data is in the current working directory i.e. if inbase+'ms/mms' is there. If not, then it will check the same within data_dir. If so, the mms or ms is rsync'ed to the current directory. This is a failsafe for if the data has already been partially processed. Otherwise, the FITSIDI files (if importfitsidi = 1) are concatenated from data_dir

jmoldon commented 7 years ago

OK, after testing different possibilities now I see what is the default behaviour you expected and what happens in different situations. For now I'm OK having all the ms/mms files in the current directory, so if you want I can revert the commit I did when I thought it was a typo/bug, sorry.

But I think we still have to rethink a bit the directory structure for two reasons: (1) I think the directory structure should be flexible (being able to use places different than cwd). I think we need to allow the user to choose if they want these paths: -- "raw_data": where the fits/FITS files are. Default: './raw_data' -- "data_dir": where all the processed ms/mms files will be. Default: './data' -- "plot_dir": for the plots. Default: './plots' -- "calib_dir": where all the calibration tables are. Default: './calibration'

(2) Before the main imports you run sys.path.insert(0,'./CASA_eMERLIN_pipeline'), which assumes that the pipeline can only be started from one folder above from where the github was downloaded. I imagine in the long term we will have a stable version locally (maybe with some personal changes for a particular system), and we will want to call it from anywhere without needing to clone from git it every time.

What do you think? For the moment I'm happy to limit the amount of choices and just work in the cwd, but the structure in (1) is very easy to implement. For number (2) we have to think what is the best option. In the EVLA pipeline (at least the last time I used it) you have to change the pipeline to explicitly say where the pipeline is, which I don't particularly like, but it is a possibility. We should open an issue to discuss that particular point.

jradcliffe5 commented 7 years ago

I agree 1) is the way forward as it means that the pipeline is flexible + can be called from anywhere.

To fix 2) and specify to casa where the CASA pipeline is on the system we could either make a build instruction to create a soft link to the directory or to add pipeline_dir variable and then specify this (not sure how the GUI will work with this)

So I would say up to now, revert the commit. And I will add the extra directories in 1) and try to make it completely independent sometime towards the end of the week. I agree 1) is the way forward as it means that the pipeline is flexible + can be called from anywhere.

To fix 2) and specify to casa where the CASA pipeline is on the system we could either make a build instruction to create a soft link to the directory or to add pipeline_dir variable and then specify this (not sure how the GUI will work with this)

So I would say up to now, revert the commit. And I will add the extra directories in 1) and try to make it completely independent sometime towards the end of the week.