Added new class in cax/tasks/filesystem.py to add the raw size in byte

lucrlom commented 7 years ago

Fixes #86:

I added a new task called AddSize to evaluate and check the size of raw data and their completeness. I tested the script only from midway, but I think it should work on xe1t-datamanager as default before to start the copy with Rucio. I didn't set a particular host where it have to work, because in general should be able to verify the raw in each site and then evaluate the size (this is a kind of double check in case of multiple copies of raw data) To use it is enough add in the task list "AddSize"

lucrlom commented 7 years ago

I fixed some variable used before of definition

pdeperio commented 7 years ago

check codacy report for error descriptions. just some style issues

pdeperio commented 7 years ago

Should it just go into a single field, e.g. "data.raw_size" (i.e. not attached to any particular site)?

Didn't someone (@XeBoris, @coderdj) already create a field for this manually for some runs?

lucrlom commented 7 years ago

@XeBoris Basically this operation should be done for all repository where cax running. Maybe as suggest by @pdeperio is better put it in a separate field from the hosts. Let me check. @XeBoris Interesting differences btw GB and GiB, really I didn't know that. I will correct it. Thanks

XeBoris commented 7 years ago

Right now each raw data location get its own size entry (See run e.g. 8999). My script adds the location for rucio-catalogue and tsm-server. But this is just a one time action. @coderdj started already to add fields for the size.

pdeperio commented 7 years ago

did you ask @coderdj for his code snippet/MongoDB call for adding this? did he do it per run or per data location (i can't find example run he did in #86)?

XeBoris commented 7 years ago

@pdeperio I had a chat with Francesco today after the telecon. We will add a "raw_data_size" field with is then common for all data fields regarding the raw data and each processed file gets its own "size" field.

lucrlom commented 7 years ago

@pdeperio @XeBoris @coderdj It took a bit, but I added a new field in the run_doc called raw_size_byte. This is the only place where I can put in common with all sites. I had the very bad idea to put alone in data field but I realized that when cax find an entries in data where are not present also type, host... rises a error. OK. I put a check to evaluate the size only if raw_size_byte is not present in the rundb. you can check on the run 170324_0256. OK. Now next step is to add for each processed the entry the size of processed file where present. TO BE DONE.

Another think I checked that the corresponding size evaluated with du -h is size/1024/1024/1024 in this case 395 MB

lucrlom commented 7 years ago

Added the size_byte field also in the entries of processed runs in data array. Now I'm going to test on datamanager

lucrlom commented 7 years ago

Dear all, Thursday I tested and update the implementation of the file size for the raw and processed data in the Rundb. There are a couple of thing to be fixed yet:

First (main priority) I have to recover the data infos for the run 10900 only for the midway-login1 entry of the processed files (pax_v6.6.5).

I made an error using the $pull function to remove the wrong name for the size entry (size_entry) and set the new one (@Dan can you fix it?)

I have to change the same entry in the run 170324_0256 (muon veto), but I didn't to don't make another mess. The command I used in the update function was: {$pull: { 'data': {'size_byte': number } } }, but it deleted all the entry. I tried to use also {$pull: { 'data.$.size_byte': number } } } but without any change on the db.

I tried to use the Dan's script to restore from the backup but I had some problem to open two connection to the database with two different ports (27017,27018).

The Second thing is which standard format we want to use to put the size info in the database. Now in this last version there is a new field only the raw size called raw_size_byte (I thing is a good name because explain you what is it) and for each processed file there is the field size in the single entry of the data array.

The same thing is present in the entry for the raw data of Rucio catalog (I changed the my previous name for this reason), but this field is not present in the entry of xe1t-datamanager neither in midway-login1 of login (OSG). This in general is not a good thing because can generate some error when you try to read a field present in some entries and lacking in other ones in the same array. How we want to set this info here??

Ciao Francesco

lucrlom commented 7 years ago

Guys can I merge this code? Is better if you check again before

pdeperio commented 7 years ago

Did you address all the issues in your last comment?

Also, can you point us to some runs that you have tested on already and show the new fields in the RunsDB?

XeBoris commented 7 years ago

@lucrlom Once your branch is merged all "new" raw datasets will get automatically the raw data size information in the runDB based on your definition. I will then change my old script to your organization of the raw data size information and add then this information to the runDB. Therefore I need to run script only once. I think we merge then our efforts here quite well. Please show me also one example of the new fields in the runDB. Thank you.

lucrlom commented 7 years ago

Sorry Guys I lost your comments. Two examples are in the runs 10900, 10901 and 10902. @XeBoris I'm seeing that in the last runs the size in the rucio entries are no more present. Why? Did you removed it?

I'll try to perform a final test also on xe1t-datamanager to write the size of the raw data.

XeBoris commented 7 years ago

@lucrlom I don't see raw_size_byte for run 10901 and 10902 in the runDB. But would be good to have then a final test. This should also update the processed file size (e.g. for pax6.8.0).

lucrlom commented 7 years ago

@XeBoris raw_size_bite is not present present because I tested on midway and there are not raw data. I'd like to test also on xe1t-datamanager to put the variable raw_size_bite

lucrlom commented 7 years ago

@pdeperio @XeBoris I realized now that we changed the number of events for each .zip file in the raw data, now are only 100 instead 1000. This means that I have to change the control on the size of the raw data.

lucrlom commented 7 years ago

I did the last test of the AddSize class and I changed the integrity check of the raw data. I think is good to perform this check. As suggested by Patrick I'm going to read the variable run_doc['reader']['ini']['trigger_config_override']['Zip']['events_per_file']

this variable is present only in the latest runs, while in the old runs with 1000 events per zip-files the field 'Zip' doesn't exist and for the LED data also 'trigger_config_override' is not present. This could generate some problem in general, but I did few check to be sure to read the right variables in the right position. Of course if the 'Zip' variable is not present I will assume to have 1000 events. I checked the last LED runs and still have 1000 event per zip file.

I performed the tests in 'datamanager', 'midway' and 'login' and it seem to work properly in all of the host. the run I used 13555, 13400 13401, 10900, 10901, 10902 and 4011

-- the run 13555 on datamanager and midway (raw_size_byte and on 'data' the size for each processed files) -- the run 13400 on datamanager, midway and login (raw_size_byte and on 'data' the size for each processed files) -- the run 13401 on midway and login (raw_size_byte and on 'data' the size for each processed files) -- the run 10900 on datamanager, midway (raw_size_byte and on 'data' the size for each processed files) -- the run 10901 on midway (on 'data' the size for each processed files) -- the run 10902 on midway (on 'data' the size for each processed files) -- the run 4011 on login (on 'data' the size for each processed files)

pdeperio commented 7 years ago

Thanks @lucrlom. Can you please clarify the following?

13400: missing size for TSM, Rucio entries (which exist for 10900), expected since you have raw_size_byte.
13401: missing raw_size_byte and size for Midway processed file, both you state should exist.
10901, 10902, 4011: missing raw_size_byte, even though size for TSM, Rucio entries exist. Could it just be copied?

lucrlom commented 7 years ago

@pdeperio The class can put the size only if the data are present on the host where it running. On datamanager we have only the raw data, therefore it can add only this information on database ('raw_size_byte'). Why on TSM and Rucio catalogue there is no more the size in the last runs I don't know. Maybe @XeBoris can tell us why. I can copy the size if the raw data are present only on Rucio catalogue, but in the other case if the 'size' variable is not present I should download the raw data and then calculate it in some host (midway seems to be the best place)

XeBoris commented 7 years ago

@lucrlom @pdeperio The reason why rucio and tsm does not have the size anymore is the following: When Francesco and me in the past started to work on the raw/processed file sizes we came to the conclusion that we need to estimate the raw data size from the rucio catalog because we don't have the old raw data sets stored somewhere. Therefore I created a script which calculates the raw data size in bytes from the rucio catalog and put this information into the data/rucio/size and data/tsm/size field. This script runs quite long and was only created for a one time usage. Now Francesco almost finalized his branch with the raw size information, which is stored now in 'raw_size_byte', that it is not necessary anymore to put the raw data size into tsm and rucio too.

Therefore, once this pull request is merged I will use my old script to add the raw data sizes (based on the rucio entries) to 'raw_size_byte' for the remaining runs. Furthermore I will remove my the size information from data/rucio and data/tsm entry again to avoid having the same information twice in the database for the same run. @pdeperio : you suggested to copy this information -> This can be done but since it is necessary to calculate more raw data sizes anyway I would just re-calculate it and remove the old runDB entry.

lucrlom commented 7 years ago

@XeBoris Ciao Boris, thanks for your reply. For the old raw data i can check if they have the size entry in the Rucio field or in TSM field and set as value for raw_size_byte, fi not or calculate (if present the raw data locally) or left empty. Otherwise I should download and calculate the size (as last solution)

pdeperio commented 7 years ago

Great! I think this is good to go then. @XeBoris please go ahead and merge.

XeBoris commented 7 years ago

@lucrlom I need to revisit my old script again and do some changes to calculate the raw data size in bytes from the rucio catalog but I can not recommend to download data just for calculating the size. This will produce a heavy work load in the system which is unnecessary. You could set the raw_size_byte from TSM or rucio information but I also think this is not necessary because then you have a few lines of code which does something very specific for some raw data which needs to be done only once. Therefore I suggest to leave this work to my script which is then only executed once.

lucrlom commented 7 years ago

@XeBoris ok, sounds good

XENON1T / cax

Added new class in cax/tasks/filesystem.py to add the raw size in byte #113