Closed Johnsonzcode closed 1 year ago
@Johnsonzcode again, without access to the data to try the code myself it is quite tricky to debug the code. I have tried to implement a quick fix, try to copy the newer version of 09C-DetectDuplicateContigs
in the bin folder of the repository. Place it into the bin folder, make it executable with chmod a+x bin/09C-DetectDuplicateContigs
and resume the analysis.
But I can't see your email. It is been hiddened. The packed graph five is 9.1Gb. How could I send to you?
Yeah, sorry I just noticed it (I had limited internet connection yesterday). My email address is andrea.talenti@ed.ac.uk and you can find it here, searching for Andrea Talenti on the email page of the University of Edinburgh. You can share it using services like WeTransfer, dropbox, onedrive or google drive. Anyway, if the alignment.blasttab
is not empty, then the graph is now compliant with the analyses, and the problem lies with the R script, whose dependencies can be tricky to coordinate in an anaconda environment (especially if you have another local installation of R). In this case, you can just share the alignments.blasttab
and contigs.txt
and I can try to figure out what is going on.
I filtered out small size scaffolds without collinearity to other genome's chromosome. So there is no contigs in my graph. I faked a empty contigs.txt
. The alignment.blasttab
has been send via email.
alignment.blasttab
is 32Mb. Email has been rejected. So I compress it with Bandzip
. You should remove suffix .txt
and decompress it.
alignments.blasttab.7z.txt
@Johnsonzcode I forgot the contigs.txt is a dummy file. I'll try to implement a fix and come back to you when ready. Thanks for the file.
Thank you for your great help!
@Johnsonzcode apologies I confused the inputs. Could you share with me the fai file present in the same folder?
Sorry for late. Of cource. candidate.fa.fai.txt
@RenzoTale88 Could I ask the situation?
Working on it
@Johnsonzcode I could not reproduce the bug you came across, but fix other to make the code compliant with recent versions of R. I've uploaded a new version of 09C-DetectDuplicateContigs in case you want to try it. However, keep in mind I did not come across the same issue, so I can't be sure I fixed this
OK, I will try to find the issue by myself. I think it is R packages issue from the log.
@Johnsonzcode if you have R installed in the same system you run nextflow, you can try matching the version of R in environment.yml
with the one you have installed in your machine. I noticed sometimes using R in anaconda in nextflow causes some bizarre issues, though quite unpredictable and therefore difficult to fix.
@Johnsonzcode I could not reproduce the bug you came across, but fix other to make the code compliant with recent versions of R. I've uploaded a new version of 09C-DetectDuplicateContigs in case you want to try it. However, keep in mind I did not come across the same issue, so I can't be sure I fixed this
Do you think it is the R packages version issue?
If I try the pipeline by the new version of 09C-DetectDuplicateContigs
, Is it still the error?
Difficult to tell. I would recommend the new script because it introduces some new compatibility fixes. If the problem persists, I would try the R version solution.
From: johnsonz @.> Sent: Wednesday, January 11, 2023 10:35:21 AM To: evotools/CattleGraphGenomePaper @.> Cc: RenzoTale88 @.>; Mention @.> Subject: Re: [evotools/CattleGraphGenomePaper] Help about non-reference sequence detection (Issue #1)
@Johnsonzcodehttps://github.com/Johnsonzcode I could not reproduce the bug you came across, but fix other to make the code compliant with recent versions of R. I've uploaded a new version of 09C-DetectDuplicateContigs in case you want to try it. However, keep in mind I did not come across the same issue, so I can't be sure I fixed this
Do you think it is the R packages version issue? If I try the pipeline by the new version of 09C-DetectDuplicateContigs, Is it still the error?
— Reply to this email directly, view it on GitHubhttps://github.com/evotools/CattleGraphGenomePaper/issues/1#issuecomment-1378471930, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFRTPKAI6SMCQNGBK3BIOSTWRZ5FTANCNFSM6AAAAAATHRV4AE. You are receiving this because you were mentioned.Message ID: @.***>
Than you for your recommandation.
@Johnsonzcode did you manage to run the workflow?
@RenzoTale88, Sorry for long time no reply. I am trying to run. But the situation is like before. No progress.
Dear @RenzoTale88
I am trying to add variants to the graph. After combining chromosome level graph with vg combine
.
One individual vcf file was selected to add for testing.
vg combine CHR*/rawCHR*.vg > merged.vg
vg add -v one_sample.vcf -t 32 > var_add_merged.vcf
But it turns out that contig breed1.CHR1 not in graph. I got a series of questions.
For short, maybe like this way:
### pesudo code
for breed in dir:
do
SURVIVOR merge samples_in_breed.txt 100 1 1 0 0 50 > SV50_100BND.vcf
done
SURVIVOR merge all_breed_vcf.txt 100 1 1 0 0 50 > SV50_100BND_all_breed.vcf
### for chromosome vg:
for i in seq(1 N); do
hal2vg --noAncestors --refGenome myref chr${i}.hal > tmp${i}.vg
vg mod -X 32 tmp${i}.vg > tmp${i}.chop32.vg
vg explode tmp${i}.chop32.vg exploded/
vg prune -r chr${i}.final.vg > chr${i}.final.pruned.vg
done
vg index -x all.xg {i}.final.vg ; done)
vg index -g all.gcsa (seq 1 N); do echo chr${i}.final.pruned.vg; done)
# then the SV50_100BND_all_breed.vcf should be split into SV50_100BND_all_breed_chr*.vcf by chromosome.
vcftools --vcf --chr * SV50_100BND_all_breed.vcf > SV50_100BND_all_breed_chr*.vcf
# then GraphVCF.py and AddToGraph.sh should be used chromosome by chromosome.
Am I right?
All the best Johnson
@Johnsonzcode I'll try to run some more tests on the workflow whenever I get time. If I manage to reproduce the error you come across I'll come back to you. Regarding the use of VG I recommend asking the VG developer directly: they will be able to provide better support for your use case, especially on the most recent versions of VG. Last time I added variants to the graph was 23 releases ago!
@RenzoTale88 Many thanks! Okay, I will ask for VG team. But is the workflow above right to add variants for the history release?
If you want to extract SVs from a cactus graph you might as well try to use the minigraph-cactus pangenome pipeline, which is more designed for this. If you want to reproduce what we have done, have a look at the scripts here, starting with submit.sh
to see the order we ran them.
Dear @RenzoTale88
What I am tring to do is that, in addition to the six assemblies of chromosome level used to construct the graph-based pan-genome, I also want to add the structural variation at the population level to the graph-based pan-genome, as described in your article.
From your scripts here. But no mention is made of how structural variation is added to the grpah-based pangenome.
AddToGraph.sh
seems to do that. I am not sure what these lines do:
graph=`head -n $SGE_TASK_ID $gv | tail -1`
if [ ! -z $vc ]; then
vcf=`head -n $SGE_TASK_ID $vc | tail -1`
spp=$sn
if [ ! -e $vcf".tbi" ]; then tabix -p vcf $vcf; fi
fi
gName=`basename -s ".vg" $graph`
export TMPDIR=`pwd`
echo $vcf
echo $spp
echo $graph
if [ ! -z $vcf ] && [ ! -z $spp ] ; then
echo $spp
echo "${vg}/vg add -n ${SGE_TASK_ID}=${spp}.${SGE_TASK_ID} -v $vcf $graph > GRAPH/${gName}.tmp.vg"
vg mod -X 32 $graph | vg ids -s - | vg add -n ${SGE_TASK_ID}=${spp}.${SGE_TASK_ID} -v $vcf > GRAPH/${gName}.tmp.vg
vg ids -s GRAPH/${gName}.tmp.vg > GRAPH/${gName}.final.vg && rm GRAPH/${gName}.tmp.vg
else
vg mod -X 32 $graph | vg ids -s - > GRAPH/${gName}.final.vg
fi
Especially, ${SGE_TASK_ID}=${spp}.${SGE_TASK_ID}
, is that says you are adding variants to graph-based by chromosome?
All the best Johnson
In that particular script we add the variants from a vcf file using VG add on a chromosome by chromosome basis, providing the backbone genome as ${spp}
and the chromosome number as the SGE variable SGE_TASK_ID. The option you mention is to link the chromosome to the right species, since the VCF is from a linear genome study.
Again, you should really ask the VG team for help in this. We're not developing the tool, and are in no position to offer you guidance over how to perform your analyses, especially after 23 releases of the tools we used.
@RenzoTale88 Thank you very much for your kind,You've helped me so much!
Dear @npch @prenderj @RenzoTale88 @prasundutta87 Your code shows us a clear way to constrcut a pangenome. When I came to detect non-reference sequence, I can't find a help documentation. I don't understand the parameters in
CattleGraphGenomePaper/detectSequences/nf-GraphSeq/conf/params.config
. Could you please write a help documemtation for us or explain the parameters for us ?Thank you in adcance!
Best Johnson