Closed DustinSokolowski closed 2 years ago
It probably never got it in the first place.
I would kill that job (make sure you keep track of which split it’s running on).
Then use the below to remove dups just on that file. Then concatenate all the merged_dups, dups, and opt_dups (via cat) into the three big files. And relaunch juicer at -S final.
# Usage:
function abs(v) {
return v<0?-v:v;
}
BEGIN {
dupname=name"dups.txt";
nodupname=name"merged_nodups.txt";
}
{
if ($1!=p1 || $2 != p2 || $3 != p3 || $4 != p4 || $5 != p5 || $6
!= p6 || $7 != p7 || $8 != p8){
print > nodupname
}
else {
print > dupname
}
}
{
p1=$1;p2=$2;p3=$3;p4=$4;p5=$5;p6=$6;p7=$7;p8=$8
}
On Fri, Mar 19, 2021 at 1:58 PM Dustin Sokolowski @.***> wrote:
Thank you for this great tool.
I am running the juicer.sh pipeline using the PBS version and I am on the deduplication step. It submits about 100 jobs, 99 of them finish in ~10 hours. There is one final job however that has been running for 130+ hours and shows no signs of stopping. I looked into an type previous error that may have this issue and the closest I saw was to add a "j" flag (#169 https://github.com/aidenlab/juicer/issues/169). When I tried to reproduce this, I see that -j is now an illegal flag. Is it possible that this flag was not carried over to the PBS version (or is no longer needed?). Otherwise, is there a way to speed up the deduplication step or cancel this single job and let the remaining 99+% of the Hi-C data deduplicate and align?
Thanks so much, Dustin
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EW3TZID7ORYRVI7DUD3TEOGE3ANCNFSM4ZPH5TEQ .
-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org
Hey!
Thank you so much for the quick reply. I'll let you know how it works.
Best! Dustin
Hey!
I'm sorry to message twice. I noticed that at this stage I have _dups.txt and _nodups.txt files but no opt_dups.txt files yet.
Looking into the "split_rmdups.awk" It looks like those opt_dups.txt files should come at the end step of deduplication and after _dups.txt and nodups.txt were made? Since _dup.txt and _nodups.txt didn't get the "all ok" (because one split got stuck), couldn't generate the opt_dups files.
Do you know if there is any work-around for this?
No, they should be created at the same time. They might not be there if the program is unable to determine opt dups (depends on readname).
I would concat the dups and merged_nodups and count the lines and see if it adds up to the number of lines in merged_sort (or you could also try ls -l on all 3 and see if it adds up)
On Fri, Mar 19, 2021 at 2:34 PM Dustin Sokolowski @.***> wrote:
Hey!
I'm sorry to message twice. I noticed that at this stage I have _dups.txt and _nodups.txt files but no opt_dups.txt files yet.
Looking into the "split_rmdups.awk" It looks like those opt_dups.txt files should come at the end step of deduplication and after _dups.txt and nodups.txt were made? Since _dup.txt and _nodups.txt didn't get the "all ok" (because one split got stuck), couldn't generate the opt_dups files.
Do you know if there is any work-around for this?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214#issuecomment-803032251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EWYSCLIFKPX6KV356CDTEOKMXANCNFSM4ZPH5TEQ .
-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org
Hey!
Thank you for your prompt reply.
It looks like the combination of _dups.txt and _nodups.txt is close to the size of merged_sort (99.2% of the size).
Starting from the "align" directory:
ls -l *dups.txt > dups_size.txt
cat dups_size.txt | awk '{ print $5 }' | paste -sd+
160217286868
ls -l merged_sort.txt
161559666042
For a given split, dups and nonodups are not making the same size file:
1993809 Mar 13 17:07 C68502_msplit0450_dups.txt
103420185 Mar 13 17:07 C68502_msplit0450_merged_nodups.txt
Lastly I've attached the first 1000 lines of merged_sort.txt to see if it's lacking the necessary readname.
If we do not need the opt files can we proceed without them?
Best! Dustin merged_sort1k.txt
You don’t need to worry about opt dups.
Did you already cat to create merged_nodups? Here is what I’m suggesting you verify (set the outputdir var to aligned for example)
total=1
total2=0
total=ls -l ${outputdir}/merged_sort.txt | awk '{print $5}'
total2=ls -l ${outputdir}/merged_nodups.txt ${outputdir}/dups.txt ${outputdir}/opt_dups.txt | awk '{sum = sum + $5}END{print sum}'
if [ -z $total ] || [ -z $total2 ] || [ $total -ne $total2 ] then echo "***! Error! The sorted file and dups/no dups files do not add up, or were empty. Merge or dedupping likely failed, restart pipeline with -S merge or -S dedup" echo "Dups don't add up. Check ${outputdir} for results" exit 1 fi
On Fri, Mar 19, 2021 at 2:50 PM Dustin Sokolowski @.***> wrote:
Hey!
Thank you for your prompt reply.
It looks like the combination of _dups.txt and _nodups.txt is twice the size of merged_sort.
Starting from the "align" directory:
ls -l *dups.txt > dups_size.txt cat dups_size.txt | awk '{ print $5 }' | paste -sd+
320434573736
ls -l merged_sort.txt
161559666042
For a given split, dups and nonodups are not making the same size file:
1993809 Mar 13 17:07 C68502_msplit0450_dups.txt
103420185 Mar 13 17:07 C68502_msplit0450_merged_nodups.txt
Lastly I've attached the first 1000 lines of merged_sort.txt to see if it's lacking the necessary readname.
If we do not need the opt files can we proceed without them?
Best! Dustin merged_sort1k.txt https://github.com/aidenlab/juicer/files/6173474/merged_sort1k.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214#issuecomment-803041277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EW7YJB34O2XWJRS5WDLTEOMI3ANCNFSM4ZPH5TEQ .
-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org
Hey!
Thanks again. I ran your script and it looks like they do not add up.
The sum of dups.txt and merged_nodups.txt are 99% the size of merged_sort.txt
With that in mind is it possible that the single split is working it's way through really slowly for some reason? I tried to restart this a couple times and the same split slowed down to a crawl each time.
Best, Dustin
So I would stop that split (the slow one) and run the rmdups.awk script I sent, replacing the corresponding mplit_merged_nodups and msplit_dups
Alternatively you could just run the rmdups.awk script diiectly on the merged_sort. It might not take too long depending on how big merged_sort is.
On Fri, Mar 19, 2021 at 5:33 PM Dustin Sokolowski @.***> wrote:
Hey!
Thanks again. I ran your script and it looks like they do not add up.
The sum of dups.txt and merged_nodups.txt are 99% the size of merged_sort.txt
With that in mind is it possible that the single split is working it's way through really slowly for some reason? I tried to restart this a couple times and the same split slowed down to a crawl each time.
Best, Dustin
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214#issuecomment-803146254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EW65IXNYDRKOD26WAJTTEO7MDANCNFSM4ZPH5TEQ .
-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org
Thanks so much, I'll give it a shot!
Hey
Thank you again! I think that it worked!
I have one last question if that's alright. In the subsequent steps I got an error saying that there are no reads in the Hi-C contact matrices "java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment."
Is this section required to make the .hic file for genome assembly? Furthermore since i set -s "none" should this be run? Lastly is there a way of checking if the hic file was properly made or should I just continue to 3D-dna and check the output there?
Thanks! Dustin
If you’re using 3D DNA you can stop once the merged nodups is created.
On Sun, Mar 21, 2021 at 12:30 PM Dustin Sokolowski @.***> wrote:
Hey
Thank you again! I think that it worked!
I have one last question if that's alright. In the subsequent steps I got an error saying that there are no reads in the Hi-C contact matrices "java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment."
Is this section required to make the .hic file for genome assembly? Furthermore since i set -s "none" should this be run? Lastly is there a way of checking if the hic file was properly made or should I just continue to 3D-dna and check the output there?
Thanks! Dustin
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/aidenlab/juicer/issues/214#issuecomment-803616535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EW7TZNKHTE6QGPDAJBLTEYNKPANCNFSM4ZPH5TEQ .
-- Neva Cherniavsky Durand, Ph.D. | she, her, hers Assistant Professor | Molecular and Human Genetics Aiden Lab | Baylor College of Medicine www.aidenlab.org
Thank you for this great tool.
I am running the juicer.sh pipeline using the PBS version and I am on the deduplication step. It submits about 100 jobs, 99 of them finish in ~10 hours. There is one final job however that has been running for 130+ hours and shows no signs of stopping. I looked into an type previous error that may have this issue and the closest I saw was to add a "j" flag (https://github.com/aidenlab/juicer/issues/169). When I tried to reproduce this, I see that -j is now an illegal flag. Is it possible that this flag was not carried over to the PBS version (or is no longer needed?). Otherwise, is there a way to speed up the deduplication step or cancel this single job and let the remaining 99+% of the Hi-C data deduplicate and align?
Thanks so much, Dustin