davidobrien1985 / davidobrien1985.github.io

1 stars 0 forks source link

PS v python #13

Open davidobrien1985 opened 4 years ago

martin9700 commented 4 years ago

I'm making assumptions here, but I think they're pretty solid. You have an array ($Records) that is a 1 to 1 representation of your file structure, so 130,000 records. You are then filtering it out 129,999 records 130,000 times (each file in the loop). This is really inefficient.

I made this small change which should significantly increase your performance:

$sourceloc = './dataset'
$targetloc = './target'
$records = Import-Csv train_labels.csv | Group-Object -Property FileName -AsHashTable
$files = Get-ChildItem -Path $sourceloc
foreach ($file in $files){
  #$data = $records| where-object FileName -eq $file.Name
  $destinationfolder = $targetloc +'\'+ $records[$file.Name].scientific_name
  $sourcefile = $sourceloc + '\' + $file.name
  Copy-Item -Path $sourcefile -Destination $destinationfolder -Force
}

As you've seen, Get-ChildItem is notoriously slow. If this wasn't a one off operation I'd look at using DIR or Robocopy to get this information much quicker. If I can link build a little, see here.

jkavanagh58 commented 4 years ago

Not arguing right tool for right job. I too have been working in PowerShell since V1, one thing I learned after many years is how correctly casting variables can increase performance. For example using [System.IO.DirectoryInfo] for your loc variables.

anhlqn commented 4 years ago

Get-ChildItem for sure is slow with your 130,000 because I think it has to go to each file to retrieve its properties. I used to manipulate millions of small files and depending on the structure, it's much faster to use the DIR batch command to list files.

Someone also measured and compared the performance of foreach at https://mcpmag.com/articles/2016/07/06/powershell-code-for-performance.aspx

Jeff-Jerousek commented 4 years ago

I would also suggest using the newer methods of where and for-each.

$files.foreach({
  $data = $records.where({FileName -eq $file.Name)}
  $destinationfolder = $targetloc +'\'+ $data.scientific_name
  $sourcefile = $sourceloc + '\' + $file.name
  Copy-Item -Path $sourcefile -Destination $destinationfolder -Force
})
netsec4u commented 4 years ago

In addition to Get-ChildItem being slow, Import-Csv and Copy-Item are slow also. These are great for quick one liner adhoc tasks; however, for large tasks where efficiency is necessary, directly leveraging dot net objects and methods is the way to go.

Everyone agrees with the statement about the right tool for the job, but you must chose the best methods of tool use for the job too.

ghost commented 4 years ago

Aren't you copying in posh and moving in python?

aperally commented 4 years ago

Which PS version have you been using for this? PS 'Core' is much more performing

MertSenel commented 4 years ago

@davidobrien1985 one thing you could do is multi thread the script as well. Your source is static and you are not recursing so I assume all of your source files are in flat hierarchy.

Get-childitem is slow but even then in your loop you are putting the burden of all of your load into single cpu as PS is single threaded by nature.

What you could do is once you get the list of files divide them into chunks as how many cpus the host has (this number can be retrieved via cmdlets as well) and then process each chunk as a 'threadjob'

https://docs.microsoft.com/en-us/powershell/module/threadjob/start-threadjob?view=powershell-7

Its fairly simple to use and as there is no racing condition for your target, so it is safe to implement as well.

This would make use of all of your (v)CPU cores hence automatically increase the performance.