As a developer, I don't want posted data to be removed after evaluation completes if that evaluation failed - Githubissues

epag commented 2 months ago

Author Name: Hank (Hank) Original Redmine Issue: 108899, https://vlab.noaa.gov/redmine/issues/108899 Original Date: 2022-10-05 Original Assignee: Arvin

I thought I already had a ticket for this, but I can't find it. This may be dangerous if COWRES becomes popular and lots of users use direct posting. Saving the data allows for reproducing errors, but most errors are likely declaration errors, not data errors, so we may end up saving quite a few files for no reason.

I guess I'm on the fence about this and willing to reject if we decide its too risky.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T14:53:22Z

Related to a development ticket, #108884, that involves posted data that is not quite right. It and the connected #103897 are why I wrote this ticket.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T16:47:56Z

The file deletion occurs at the end of the @call@ method in @JobResults.java@:

        public Integer call() throws IOException, TimeoutException
        {
            LOGGER.debug( "call called on {}", this );
            String jobId = this.getJobId();

... SNIP ...

            JobMetadata sharedData = this.getJobMetadata();
            LOGGER.debug( "Shared metadata before setting exit code to {}: {}",
                          resultValue, sharedData );
            sharedData.setExitCode( resultValue );
            LOGGER.debug( "Shared metadata after setting exit code to {}: {}",
                          resultValue, sharedData );

            if ( resultValue == 0 )
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_SUCCESS );
            }
            else
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_FAILURE );
            }

            LOGGER.debug( "Shared metadata after setting job state: {}",
                          jobMetadata );
            JobResults.deleteInputs( jobMetadata );
            return resultValue;
        }
</code>

Should be easy enough to move the deletion into the preceding check of @resultValue@. Setting the estimated time at 4 hours, primarily because of testing, but it remains to be decided if saving the inputs for a failed evaluation is worth the risk of running out of disk space.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-10-05T16:52:13Z

As I alluded to in #108884-9, I would probably start with extra logging before we do this. The logging would include information such as the number of bytes written and the checksum of the bytes written. This is way cheaper than retaining the output on the file system and has no risk of causing a disk overflow, which I agree is a significant risk, especially for longer deployment cycles.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-05-31T19:28:31Z

Pointed Arvin to this ticket and specifically the previous comment from James.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-01T18:45:37Z

Per the conversation in the meeting, we are planning to not remove data when an evaluation that used the posted data fails. We may need to accompany that with (1) a request for more disk space (probably best coming from me when it is needed); and (2) a tool that removes least recently used data if the disk fills up past some threshold (this is optional, but may be warranted).

Arvin, James: If that summary appears incorrect, let me know. Thanks!

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-01T18:49:01Z

            if ( resultValue == 0 )
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_SUCCESS );
                JobResults.deleteInputs( jobMetadata );
            }
            else
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_FAILURE );
            }
</code>

Changed the code so that it deletes the files ONLY if the the @resultValue@ is 0 (In other words meaning success)

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-02T17:26:47Z

Did some research on the @Java.nio.file@ and it seems that this new version of the @Java.io@ works on mounted drives as well!

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-02T19:11:44Z

            if ( resultValue == 0 )
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_SUCCESS );
                JobResults.deleteInputs( jobMetadata );
            }
            else
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_FAILURE );
                if(JobResults.decideToDelete() >= 90.0){
                    LOGGER.debug("Disk space has reached threshold... deleting input");
                    JobResults.deleteInputs( jobMetadata );
                }
            }
</code>

    private static double decideToDelete(){
        Path path = FileSystems.getDefault().getPath("/");

        double percentageUsed = 0.0;

        try{
            FileStore fileStore = Files.getFileStore(path);

            long totalSpace = fileStore.getTotalSpace();
            long usableSpace = fileStore.getUsableSpace();
            long usedSpace = totalSpace - usableSpace;

            percentageUsed = (((double) usedSpace / totalSpace) * 100);
        } catch (IOException | NumberFormatException e){
            e.printStackTrace();
        }
        return percentageUsed;
    }
</code>

Code/logic added to calculate the size of the disk being used and whether or not we can keep the file. The threshold as of right now is set to 90% but that can be changed as time goes on. Right now if the threshold of 90% is met then the files that were used for the calculation will delete. We can change this as we find necessary such as deleting the oldest modified, biggest file size, etc. Any feedback on this? Also, for testing purposes I need to know the directory of the mounted disk, I have it set to the root/default right now, so this may or may not work (needs testing).

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-02T19:38:26Z

Hey Arvin,

Let's discuss more next week. I think Evan had a good suggestion about being selective in what we delete by using something akin to an LRU cache, deleting the oldest stuff, up to a target threshold or, minimally, the amount necessary so that the space falls under the high threshold. It won't be exactly an LRU cache because we are talking about some file metadata as seen by the OS (i.e., when they were created or last accessed). I'm not sure if there will be a robust mechanism to get this information, but it's worth discussing further.

Beyond that, please use the code template for IDEA before you push any code (@devUtils/wres_idea/format_settings.xml@).

Also, please use a logger in all circumstances, rather than printing information to standard out or standard error.

As an aside, you'll probably want to install SonarLint or similar in IDEA so that you can follow the tips/hints for code smells before you push - I find it super useful in catching simple problems.

The final thing to think about is whether and how we can unit test this.

Thanks!

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-02T19:39:28Z

And have a great weekend!

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-02T20:04:27Z

Sounds good! I have no pushed any code yet for this exact reason of not knowing the standard but I think I got it now. In terms of Unit Testing, the @decideToDelete@ method is fairly straight forward to test. I did also speak to Evan and we spoke more about his idea and will be adding a feature to sort and delete the latest modified files either until 14 days or the adequate space required up to 14 days. We will speak more next week!

Thank you,

have a great weekend!

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-05T17:11:26Z

Is there a possibility to get a little more insight on the directory structure of where the input data from the user's are stored?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-05T17:28:13Z

Yes, it's fairly straightforward.

The docker compose script (@compose-entry.yml@) defines the java temp dir as follows:

-Djava.io.tmpdir=/mnt/wres_share/input_data

The tasker then creates a temp file in the default temp dir (as above), using assigned posix file permissions and a file name that includes the evaluation job identifier, like so (from @WresJobInput.java@):

        // Indirect way of detecting "is this unix or not"
        if ( System.getProperty( "file.separator" )
                   .equals( "/" ) )
        {
            LOGGER.debug( "Detected unix system." );
            permissions = EnumSet.of(
                    PosixFilePermission.OWNER_READ,
                    PosixFilePermission.OWNER_WRITE,
                    PosixFilePermission.OWNER_EXECUTE,
                    PosixFilePermission.GROUP_READ,
                    PosixFilePermission.GROUP_WRITE,
                    PosixFilePermission.GROUP_EXECUTE );
            posixAttributes = PosixFilePermissions.asFileAttribute( permissions );
        }
        else
        {
            LOGGER.debug( "Detected windows system." );
            permissions = null;
            posixAttributes = null;
        }

        java.nio.file.Path temp = null;
        String md5 = "not_computed";

        try
        {
            // Content type detection is done in core WRES, not based on name.
            if ( Objects.nonNull( posixAttributes ) )
            {
                temp = Files.createTempFile( jobId + "_", "", posixAttributes );
            }
            else
            {
                temp = Files.createTempFile( jobId + "_", "" );
            }
</code>

It then copies the posted data (input stream) to the temp file.

In other words, you will see a directory like this on the -host machine- mounted share space:

/mnt/wres_share/input_data

That contains a bunch of files with the job id prefix and some random other identifier chosen by @Files::createTempFile@, and the associated posix permissions. These are the user-supplied inputs.

Then the worker-shim has read-only access to this dir to read the data. This is defined in the @compose-workers.yml@ for the @worker@ service:

         # To read user input data written by tasker:
         - /mnt/wres_share/input_data:/mnt/wres_share/input_data:ro
</code>

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-05T17:30:50Z

In short, there isn't really a directory structure per se, merely a single directory that contains the uniquely identified files associated with all evaluations that have posted data. Each file is prefixed with the evaluation job id, and then some random other id chosen by the jdk (whose path is then translated into a uri and added to the evaluation declaration).

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-05T17:42:50Z

Thank you James, do I have access to the server to be able to login and see this? If so what is the host name?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-05T17:46:02Z

I cannot really speak to nwcal machines or access (I don't have access), but Evan can probably give you an idea, pending Hank's return tomorrow. If you can access our staging and production environments (you should be able to), then you should also be able to see the @mnt/wres_share@ from one of those staging or production machines.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-06-05T18:14:46Z

You should have access to them as part of the onboarding process.

Dev: nwcal-ised-dev1 Testing: nwcal-wres-ti01

https://vlab.noaa.gov/redmine/projects/wres/wiki/Getting_Started_with_WRES_Development#A-tour-of-the-nwcal-wres-ti01 A guide on the structure

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-05T18:31:02Z

Thank you! I logged into both: @nwcal-ised-dev1@ @nwcal-wres-ti01@ but both did not have the @/mnt/wres_share/input_data@ directory. I'm assuming this may only be in the production servers?

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-06-05T19:13:06Z

Arvin wrote:

Thank you! I logged into both: @nwcal-ised-dev1@ @nwcal-wres-ti01@ but both did not have the @/mnt/wres_share/input_data@ directory. I'm assuming this may only be in the production servers?

Checked and it is available on prod. You should have access there too nwcal-wres-prod02

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-05T19:37:10Z

Just checked and I do have access and found the directory. Thank you, Evan!! :)

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-06T11:05:35Z

You should see @/mnt/wres_share/input_data@ on each of the following machines which are used to host the COWRES in our three environments, development, staging, and production:

@nwcal-wres-dev02/03@ @nwcal-wres-ti02/03@ @nwcal-wres-prod02/03@

Within each environment, the disk @/mnt/wres_share@ is cross-mounted on the two machines, so its the same @/mnt/wres_share/input_data@ on both machines in development, in staging, and in production.

To clarify for everyone, we use the following machines:

@nwcal-ised-dev1@ - ISED-wide development machine. I use this for development purposes, as well for running some convenience scripts to do quick, cursory checks of the services.
@nwcal-wres-ti01@ - WRES system testing machine. Its also used for other aspects of testing, including having tools installed to help with checking outputs.
@nwcal-wres-dev03@ - Entry machine for the -dev (development) COWRES instance. Yes, the 03 machine is the entry machine in development; not sure why.
@nwcal-wres-dev02@ - Workers-only machine for the -dev COWRES instance.
@nwcal-wres-ti02@ - Entry machine for the -ti (staging) COWRES instance.
@nwcal-wres-ti03@ - Workers-only machine for the -ti COWRES instance.
@nwcal-wres-prod02@ - Entry machine for the -prod (production) COWRES instance.
@nwcal-wres-prod03@ - Workers-only machine for the -prod COWRES instance.
@nwcal-wresdb-*@ - Postgres database servers, one in development (-dev01), two each in staging and production (-ti01/02, -prod01/02).
@nwcal-wresgui-*@ - Machines hosting the WRES GUI, one in each environment (-dev01, -ti01, -prod01).

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-06T16:23:41Z

Added an instance variable to reference this directory in multiple parts of the methods.

private static final String InputDataDirectory = "/mnt/wres_share/input_data";
</code>

Logic was added in the case of a failure and threshold of 90% is met, It will create a list of paths in the @input_wres@ directory, sort them by least recently used and continue to delete the least modified file till we have enough space under the threshold to store the failure inputs (essentially creating a form of an LRU).

            if ( resultValue == 0 )
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_SUCCESS );
                JobResults.deleteInputs( jobMetadata );
            }
            else
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_FAILURE );
                if(JobResults.decideToDelete() >= 90.0){
                    LOGGER.warn("Disk space has reached threshold... deleting input");
                    List<Path> files = JobResults.getFilesInDirectory(InputDataDirectory);
                    JobResults.sortFilesByLastModified(files);
                    long totalDriveSpace = JobResults.getTotalDriveSpace();
                    double ninetyPercentOfDriveSpace = totalDriveSpace * 0.90;
                    while(totalDriveSpace >= ninetyPercentOfDriveSpace && !files.isEmpty()){
                        Path fileToDelete = files.remove(0);
                        JobResults.deleteFile(fileToDelete);
                        totalDriveSpace = JobResults.getTotalDriveSpace();
                    }
</code>

    private static double decideToDelete(){
        Path path = FileSystems.getDefault().getPath(InputDataDirectory);

        double percentageUsed = 0.0;

        try{
            FileStore fileStore = Files.getFileStore(path);

            long totalSpace = fileStore.getTotalSpace();
            long usableSpace = fileStore.getUsableSpace();
            long usedSpace = totalSpace - usableSpace;

            percentageUsed = (((double) usedSpace / totalSpace) * 100);
        } catch (IOException | NumberFormatException e){
            e.printStackTrace();
        }
        return percentageUsed;
    }
    private static long getTotalDriveSpace() throws IOException {
        FileStore store = Files.getFileStore(Paths.get(InputDataDirectory));
        return store.getTotalSpace();
    }

    private static List<Path> getFilesInDirectory(String path) throws IOException {
        try(Stream<Path> paths = Files.list(Paths.get(path))){
            return paths.collect(Collectors.toList());
        }
    }
    private static void sortFilesByLastModified(List<Path> files){
        Collections.sort(files, Comparator.comparingLong(path -> {
            try {
                return getFileLastModified(path);
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }));
    }

    private static long getFileLastModified(Path file) throws IOException {
        BasicFileAttributes attributes = Files.readAttributes(file, BasicFileAttributes.class);
        return attributes.lastModifiedTime().toMillis();
    }
    private static void deleteFile(Path file) throws IOException {
        Files.delete(file);
    }
</code>

I do however have a question, I was thinking of adding a fail safe mechanism after the @while@ loop in case we delete all the files in the @input_res@ directory and we still are not under the threshold of 90% (most likely due to disk space being take up by other directories on the mounted disk), something like this:

if(totalDriveSpace >= ninetyPercentOfDriveSpace && files.isEmpty()){
   JobResults.deleteInputs(jobMetadata);
</code>

But I was also thinking this wouldn't be necessary because would the evaluation even begin if there was not enough space to begin that process? Let me know what you think :)

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-06T16:31:50Z

Arvin,

I'm about to look the rest of the code, but need to comment on this first:

@private static final String InputDataDirectory = "/mnt/wres_share/input_data";@

That should make use of a Java system property, I think, because you can't be guaranteed that @/mnt/wres_share@ exists wherever you install the tasker.

Let me read the rest of the comment,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-06T16:34:57Z

Oh wait... Can't you just use the property @java.io.tmpdir@, since we are already using it to point to where input data is placed? This is in the tasker definition within our deployment YAML file:

@-Djava.io.tmpdir=/mnt/wres_share/input_data@

Or maybe I'm misinterpreting how you use that variable. I really need to read the rest of your comment, now. :)

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-06T16:39:11Z

Hank wrote:

Oh wait... Can't you just use the property @java.io.tmpdir@, since we are already using it to point to where input data is placed? This is in the tasker definition within our deployment YAML file:

@-Djava.io.tmpdir=/mnt/wres_share/input_data@

Or maybe I'm misinterpreting how you use that variable. I really need to read the rest of your comment, now. :)

Hank

I need to look into this more, but I believe it is essentially doing the same thing, if its just pointed to the directory location. Let me know what you think :D

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-06T16:42:10Z

Yes, there is already an inbuilt assumption that data is written to and read from the system temp dir, as shown in #108899-14, which is available as a system property.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-06T16:43:01Z

Arvin wrote:

I need to look into this more, but I believe it is essentially doing the same thing, if its just pointed to the directory location. Let me know what you think :D

Right, but it could change. If we passed a different system property override for the @java.io.tmpdir@, your code would break.

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-06-06T16:43:25Z

Arvin wrote:

I need to look into this more, but I believe it is essentially doing the same thing, if its just pointed to the directory location. Let me know what you think :D

If it is doing the same thing that id prefer that you use the central definition in tempDir instead of defining it again in this file

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-06T16:44:24Z

James wrote:

Arvin wrote:

I need to look into this more, but I believe it is essentially doing the same thing, if its just pointed to the directory location. Let me know what you think :D

Right, but it could change. If we passed a different system property override for the @java.io.tmpdir@, your code would break.

Absolutely, you are right, will change it to the system property!

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-06T16:53:38Z

First comment is about style. For consistency, we are following guidelines established years ago for Java coding in the OWP. I'll see if I can find the old documents explaining the style guidelines. Some points I do remember...

Be sure that static final constants are in all caps with underscores; i.e., @InputDataDirectory@ should be @INPUT_DATA_DIRECTORY@.
Make sure you put braces, {}, on lines by themselves. That promotes greater white spacing, which generally makes the code more readable.

(James: If I'm misremembering either of these, please speak up.) I'm open to changing the guidelines if folks disagree with them, but consistency is important, so, if we change them, we should retrofit the old code over time.

But I was also thinking this wouldn't be necessary because would the evaluation even begin if there was not enough space to begin that process?

The evaluation will be accepted by the tasker if the disk is at 90%. That leaves 200 GBs of room to work with, and nothing in the tasker (to the best of my knowledge) will prevent an evaluation with that much disk space to spare. I'm not even sure there is any limit short of 100% that will cause a failure.

So, the question is, if the disk usage is > 90%, should we allow users to post data? I think we can at that threshold. Again, that leaves 200 GBs which should be plenty of room nearly any posted-data evaluation. Perhaps the failsafe would kick in after 95% or 98%? Or perhaps we should just stick to 90% to leave plenty of buffer? Hmmm...

Regardless, if you are going to delete all inputs associated with the job metadata when the failsafe kicks in, be sure to return an error for the request. I guess that's probably obvious. :) You might also want to set the evaluation status to a failure (??? not sure about this, since a user could wait for disk to be freed up and then try to post the inputs).

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-06T16:59:13Z

Regarding code formatting, you should install @devUtils/wres_idea_format_settings.xml@ and use that (CTRL+ALT+L or @Code@ --> @Reformat Code@). Always use that before pushing.

Also, I strongly recommend that you install sonarlint or similar. This will catch all manor of small issues including, for example, the one that Hank mentioned on java naming conventions (upper case for constants). I would also recommend that you use javadoc liberally, even on private methods unless it is totally obvious what they are doing (by default, we don't generate docs for private methods, but we could, and it's the best way to document code anyway).

I have some other suggestions too, but I may wait until you're ready to push. :-)

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-06T17:08:46Z

I have a tendency to always to do CTR + ALT + L because the formatting makes it look cleaner and the code I pasted into the ticket was formatted but I just realized I was using the default formatting schema which is what I am most use to, but I went ahead and changed it to the scheme found in the @devtools@ directory and I fixed the formatting, sorry about that!

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-06T17:13:49Z

I will definitely add documentation/Java docs, but just wanted to make sure there was a level of approval before doing that. I also just installed SonarLint, thank you for the tip :)

epag commented 2 months ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2023-06-06T17:30:31Z

Some input from me:

@if(JobResults.decideToDelete() >= 90.0){@

We are using a "Magic Number" here. Would prefer to have the method @decideToDelete()@ be something like @storageThresholdExceeded()@ that returns a boolean with a comment in the javadoc describing that we are choosing 90% as an arbitrary threshold.

@LOGGER.warn("Disk space has reached threshold... deleting input");@

LOGGER.info statements are persisted in the logs, I think this fits better as an info statement

                    long totalDriveSpace = JobResults.getTotalDriveSpace();
                    double ninetyPercentOfDriveSpace = totalDriveSpace * 0.90;
                    while(totalDriveSpace >= ninetyPercentOfDriveSpace && !files.isEmpty()){
                        Path fileToDelete = files.remove(0);
                        JobResults.deleteFile(fileToDelete);
                        totalDriveSpace = JobResults.getTotalDriveSpace();
                    }

Dislike the use of the hard coded number here as well and we don't need to store totalDriveSpace, I think we could just leverage the @storageThresholdExceeded()@ suggested above

Could end up looking like:

                    while(JobResults.storageThresholdExceeded() && !files.isEmpty()){
                        Path fileToDelete = files.remove(0);
                        JobResults.deleteFile(fileToDelete);
                    }

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-07T12:14:26Z

            if ( resultValue == 0 )
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_SUCCESS );
                LOGGER.debug( "Shared metadata after setting job state: {}",
                              jobMetadata );
                JobResults.deleteInputs( jobMetadata );
            }
            else
            {
                sharedData.setJobState( JobMetadata.JobState.COMPLETED_REPORTED_FAILURE );
                LOGGER.debug( "Shared metadata after setting job state: {}",
                              jobMetadata );
                if ( JobResults.storageThresholdExceeded() )
                {
                    LOGGER.info( "Disk space has reached threshold... attempting to make space" );
                    List<Path> files = JobResults.getFilesInDirectory( System.getProperty( "java.io.tmpdir" ) );
                    JobResults.sortFilesByLastModified( files );
                    while ( JobResults.storageThresholdExceeded() && !files.isEmpty() )
                    {
                        Path fileToDelete = files.remove( 0 );
                        JobResults.deleteFile( fileToDelete );
                    }
                }
            }
            return resultValue;
        }
</code>

    private static boolean storageThresholdExceeded() throws IOException
    {
        Path path = Paths.get( System.getProperty( "java.io.tmpdir" ) );

        FileStore fileStore = Files.getFileStore( path );

        long totalSpace = fileStore.getTotalSpace();
        long usableSpace = fileStore.getUsableSpace();
        long usedSpace = totalSpace - usableSpace;

        double percentageUsed = ( ( ( double ) usedSpace / totalSpace ) * 100 );

        return percentageUsed >= 90.0;
    }

    private static List<Path> getFilesInDirectory( String path ) throws IOException
    {
        try ( Stream<Path> paths = Files.list( Paths.get( path ) ) )
        {
            return paths.collect( Collectors.toList() );
        }
    }

    private static void sortFilesByLastModified( List<Path> files )
    {
        Collections.sort( files, Comparator.comparingLong( path -> {
            try
            {
                return getFileLastModified( path );
            }
            catch ( IOException e )
            {
                throw new RuntimeException( e );
            }
        } ) );
    }

    private static long getFileLastModified( Path file ) throws IOException
    {
        BasicFileAttributes attributes = Files.readAttributes( file, BasicFileAttributes.class );
        return attributes.lastModifiedTime().toMillis();
    }

    private static void deleteFile( Path file ) throws IOException
    {
        Files.delete( file );
    }
</code>

Made optimizations based on Evan's suggestions as well as removed the instance variable and instead using system variables. I will now begin to add unit tests and complete Java Docs.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-07T12:46:18Z

Thanks, Arvin.

A few very minor suggestions below for your consideration :-)

Do not catch a specific exception (@IOException@) and rethrow a more general one (@RuntimeException@), especially without annotating the context and especially not a @RuntimeException@, which is the most generic unchecked exception. In this case, you could use an @java.io.UncheckedIOException@, for example, but, either way, annotate the context with a message before rethrowing;
Abstract the storage threshold to a constant (static final member) as it's more transparent and easier to change if referenced more than once;
The @getFilesInDirectory@ is always passed the same path to the tmpdir. Rather than asking for the tmpdir in several places and creating a path in several contexts, create that path once and place in a constant (static final member);
Use @Path.of@, as @Paths.get@ is merely a facade;
One statement per line helps with debugging (e.g., npes where more than one thing in a chain is nullable). For example:
```
attributes.lastModifiedTime()
      .toMillis()
</code>
```

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-08T17:34:35Z

I changed the testing version from JUnit 4 to JUnit for the Wres-tasker but I also allowed for backwards compatibility for the other test classes that are written in JUnit 4. James, I made the changes you suggested regarding sonarLint. I was wondering how should we go about the dicker file and making a system variable for the threshold to be more easily accessible and manipulated?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-08T17:54:33Z

There are two files in the root dir:

@compose-entry.template.yml@
@compose-entry.yml@

They are essentially the same file, but the first one is the template and the second one is populated with the identities of the actual artifacts in production.

Inside these two files, you will see a @tasker@ service. Within the @tasker@ service you will see this line:

JAVA_OPTS=-Dwres.adminToken=${WRES_ADMIN_TOKEN} -Dwres.broker=broker -Dwres.redisHost=persister -Dwres.taskerPathToServerP12=${WRES_TASKER_SERVER_P12} -Dcom.redhat.fips=false -Djava.io.tmpdir=/mnt/wres_share/input_data -XX:HeapDumpPath=/mnt/wres_share/heap_dumps/tasker -XX:OnOutOfMemoryError='mv /mnt/wres_share/heap_dumps/tasker/java_pid%p.hprof /mnt/wres_share/heap_dumps/tasker/javapid%p$$CON_HOSTNAME.hprof; chmod 775 /mnt/wres_share/heap_dumps/tasker/javapid%p$$CON_HOSTNAME.hprof'

This is where we supply command line options to the jvm that is creating the tasker. You will already see a bunch of system property overrides within that line. For example:

@-Dwres.broker=broker@

This one sets a value @wres.broker@ to the name @broker@, which is the name of the broker that the tasker will attempt to connect with.

You can add a system property override there for the threshold. For example:

@-Dwres.dataDirectDiskThreshold=90@

You can then access this from within the tasker configuration, defaulting to the @DEFAULT_THRESHOLD@ or whatever you called it when you created that default.

For example:

    double diskThreshold = DEFAULT_THRESHOLD;
    String userThreshold = System.getProperty( "wres.dataDirectDiskThreshold" );
    if( Objects.nonNull( userThreshold ) )
    {
        try
        {
            double provisionalThreshold = Double.parseDouble( userThreshold );

            // Meets expectation?
            if ( provisionalThreshold > 0.0 && provisionalThreshold < 100.0 )
            {
                diskThreshold = provisionalThreshold;
                LOGGER.info( "Discovered a user-defined data direct disk threshold in the system property "
                             + "override, 'wres.dataDirectDiskThreshold'. The threshold is: {}.", diskThreshold );
            }
            // No, then warn
            else if ( LOGGER.isWarnEnabled() )
            {
                LOGGER.warn( "Discovered an invalid 'wres.dataDirectDiskThreshold'. Expected a number that is "
                             + "greater than 0 and less than 100, representing a percentage. but got: {}. Will "
                             + "use the default threshold instead: {}.",
                             provisionalThreshold,
                             DEFAULT_THRESHOLD );
            }
        }
        catch ( NumberFormatException e )
        {
            LOGGER.warn( "Discovered an invalid 'wres.dataDirectDiskThreshold'. Expected a number, but got: "
                         + " {}. Will use the default threshold instead: {}.",
                         userThreshold,
                         DEFAULT_THRESHOLD );
        }
    }
</code>

Something like that.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-08T18:23:11Z

Where would the Java code above go? Can that go in the @JobResults.Java@ class and use the template code above as a method of deciding whether to use the default threshold or the system property?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-08T18:29:19Z

Yes, it would go nearby to where you're working and require the threshold.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-09T13:10:35Z

Code Changes as Follows: @compose-entry.yml@

         - JAVA_OPTS=-Dwres.adminToken=${WRES_ADMIN_TOKEN} -Dwres.broker=broker -Dwres.redisHost=persister -Dwres.taskerPathToServerP12=${WRES_TASKER_SERVER_P12} -Dcom.redhat.fips=false -Djava.io.tmpdir=/mnt/wres_share/input_data -Dwres.dataDirectDiskThreshold=90 -XX:HeapDumpPath=/mnt/wres_share/heap_dumps/tasker -XX:OnOutOfMemoryError='mv /mnt/wres_share/heap_dumps/tasker/java_pid%p.hprof /mnt/wres_share/heap_dumps/tasker/java_pid%p_$$CON_HOSTNAME.hprof; chmod 775 /mnt/wres_share/heap_dumps/tasker/java_pid%p_$$CON_HOSTNAME.hprof'
</code>

@compose-entry.template.yml@

         - JAVA_OPTS=-Dwres.adminToken=${WRES_ADMIN_TOKEN} -Dwres.broker=broker -Dwres.redisHost=persister -Dwres.taskerPathToServerP12=${WRES_TASKER_SERVER_P12} -Dcom.redhat.fips=false -Djava.io.tmpdir=/mnt/wres_share/input_data -Dwres.dataDirectDiskThreshold=90 -XX:HeapDumpPath=/mnt/wres_share/heap_dumps/tasker -XX:OnOutOfMemoryError='mv /mnt/wres_share/heap_dumps/tasker/java_pid%p.hprof /mnt/wres_share/heap_dumps/tasker/java_pid%p_$$CON_HOSTNAME.hprof; chmod 775 /mnt/wres_share/heap_dumps/tasker/java_pid%p_$$CON_HOSTNAME.hprof'
</code>

To both added @-Dwres.dataDirectDiskThreshold=90@

@JobResults.Java@

    public static double decideThreshold()
    {
        String userThreshold = System.getProperty( "wres.dataDirectDiskThreshold" );
        if ( Objects.nonNull( userThreshold ) )
        {
            try
            {
                double provisionalThreshold = Double.parseDouble( userThreshold );
                if ( provisionalThreshold > 0.0 && provisionalThreshold < 100.0 )
                {
                    LOGGER.info( "Discovered a user-defined data direct disk threshold in the system property "
                                 + "override, 'wres.dataDirectDiskThreshold'. The threshold is: {}.",
                                 provisionalThreshold );
                    return provisionalThreshold;
                }
                else if ( LOGGER.isWarnEnabled() )
                {
                    LOGGER.warn( "Discovered an invalid 'wres.dataDirectDiskThreshold'. Expected a number that is "
                                 + "greater than 0 and less than 100, representing a percentage. but got: {}. Will "
                                 + "use the default threshold instead: {}.",
                                 provisionalThreshold,
                                 DEFAULT_THRESHOLD_DISK_SPACE_THRESHOLD );
                }
            }
            catch ( NumberFormatException e )
            {
                LOGGER.warn( "Discovered an invalid 'wres.dataDirectDiskThreshold'. Expected a number, but got: "
                             + " {}. Will use the default threshold instead: {}.",
                             userThreshold,
                             DEFAULT_THRESHOLD_DISK_SPACE_THRESHOLD );
            }
        }
        return DEFAULT_THRESHOLD_DISK_SPACE_THRESHOLD;
    }
</code>

    private static boolean storageThresholdExceeded( double diskThreshold ) throws IOException
    {
        Path path = Path.of( TMP_DRIVE );

        FileStore fileStore = Files.getFileStore( path );

        long totalSpace = fileStore.getTotalSpace();
        long usableSpace = fileStore.getUsableSpace();
        long usedSpace = totalSpace - usableSpace;

        double percentageUsed = ( ( ( double ) usedSpace / totalSpace ) * 100 );

        return percentageUsed >= diskThreshold;
    }
</code>

                double diskThreshold = decideThreshold();
                if ( JobResults.storageThresholdExceeded( diskThreshold ) )
                {
                    LOGGER.info( "Disk space has reached threshold... attempting to make space" );
                    List<Path> files = JobResults.getFilesInDirectory( Path.of( TMP_DRIVE ) );
                    JobResults.sortFilesByLastModified( files );
                    while ( JobResults.storageThresholdExceeded( diskThreshold ) && !files.isEmpty() )
                    {
                        Path fileToDelete = files.remove( 0 );
                        JobResults.deleteFile( fileToDelete );
                    }
                }
</code>

Added a method on deciding the threshold whether to go with the default value (currently set to 90) or a value passed in by the system variable. Changed the @storageThresholdExceeded@ methods signature to take in a double once the threshold has been decided by the @decideThreshold()@ method. If all looks well, I am ready to begin testing :)

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-09T13:14:02Z

Looks good to me.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-09T15:32:02Z

I am going to push my changes now, but I wanted to know the format for the commit message, would it just be:

@#108899 {My Commit Message}@?

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-09T15:36:36Z

I usually do, "[blah message]; refs #108899". However, as long as you say "refs #108899" in it, it will make the connection to this ticket.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-09T15:42:39Z

Yeah, the hook is generally @Refs #ticket_no@ or @Fixes #ticket_no@. edit: but the former is generally preferred unless you're confident it is fixed (and it will then auto-resolve the ticket).

In theory, there is "best practice" for git commit messages. From memory, it needs to be written in the imperative tone (in other words, as though you are telling git what to do to fix the software) and the first line should be no more than 50 characters and the subsequent lines no more than 70 or something like that. The first line is meant to be a summary.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-09T15:44:55Z

( As an aside, with gerrit, you'll get a warning on commit if the lines contain more than the expected number of characters, so that is nice or annoying depending on your perspective :-) )

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-09T15:51:31Z

James quick question, with the addition we added to the YMAL files, where could this value be dynamically changed (The threshold %)?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-09T15:55:56Z

In this context, "dynamically" means when the service is started. Once it is started, you cannot change it. Or, rather, you would need to cycle the tasker to change it.

But, given that definition, you would simply change the threshold within the yaml from 90 to something else, say 50. When the service starts, it follows the declaration in that yaml. I think Hank is going to help you with deployment shortly, so he will walk you through what it means to start the service.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-09T16:04:01Z

Thank you James! Yes, that is what I meant by dynamically (being able to change it without doing it from the code side), just didn't know the correct word for it.

So for clarification, to change the threshold value you need to go on the repository to the 2 YAML files where we added the sys variable and change it there correct?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-09T16:09:48Z

No, Hank will go through a deployment with you. When you deploy, you will have the yaml locally to the machine where it's being deployed and you can edit it directly. Basically, that yaml is input to a @docker-compose@ command, which starts the service. No need to push anything when testing, although we ultimately do push the yaml exactly as it was last deployed in production. More here:

https://vlab.noaa.gov/redmine/projects/wres/wiki/Operational_And_Maintenance_Instructions_(START_STOP_INSTALL)_for_the_Central_OWP_WRES

NOAA-OWP / wres

As a developer, I don't want posted data to be removed after evaluation completes if that evaluation failed #92